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Abstract  of  Dissertation  Presented  to  the  Graduate  Council 
of  the  University  of  Florida  in  Partial  Fulfillment  of  the 
Requirements  for  the  Degree  of  Doctor  of  Philosophy 

FEATURE  EXTRACTION  AND  PATTERN  RECOGNITION 
FOR  REAL-TIME  EEG  PROCESSING 

By 

Teera  Achariyapaopan 
December  1983 

Chairman:     D.  G.  Childers 

Major  Department:     Electrical  Engineering 

The  application  of  pattern  recognition  techniques   for  classifying 

single    EEG    (Electroencephalographic)     records     is    considered.  This 

technique  allows  the  processing  of  EEG  signals  in  real-time.    However,  a 

problem  arises   in   applying   this    technique.      There   are   many  potential 

features  of   the  EEG  signal  and  only  a  finite,  usually  small,  number  of 

training   samples   are   available.      It   is  well  known   that  when  a  finite 

number   of    training   samples    are   available   for   designing   a  classifier, 

there    exists    an    optimal    number    of    features    which    can    be    used  to 

represent  the  data.    This  has  led  to  the  investigation  of  the  effects  of 

a   finite   number   of   training  samples   on   feature  extraction  algorithms. 

We  show  that  for  a  two-class,  Gaussian  data  set  with  a  common  covariance 

matrix   for   each   class,    that    additional    features    can   be   added   to  the 

feature  vector  one  at  a  time  if  the  value  of  an  expression,  which  is  a 

V 


function  of  the  total  number  of  training  samples,  number  of  features, 
and  accumulated  Mahalanobis  distance,  increases  with  each  feature.  For 
a  fixed  number  of  training  samples,  an  equal  training  sample  size  for 
each  class  is  suggested.  An  algorithm  for  determining  the  optimal 
number  of  features  and  the  associated  features  is  proposed.  This 
algorithm  is  then  applied  to  select  features  from  EEG  records,  each  of 
which  consists  of  single  evoked  responses  elicited  from  human  subjects 
who  read  a  series  of  statements.  The  subjects  learned  the  truthfulness 
or  falseness  of  each  statement.  The  EEG  records  collected  for  the  true 
and  false  statements  make  up  our  two  class  data  base.  A  classifier  is 
designed  to  assign  these  evoked  event-related  potentials  (ERP's)  to  one 
class  or  the  other  (true  or  false)  based  on  the  features  selected  to 
represent  the  ERP's  for  each  class.  An  "optimal"  feature  selection 
algorithm  is  described  and  compared  with  other  feature  selection 
techniques . 
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CHAPTER  1 
INTRODUCTION 


The  electroencephalographic  (EEC)  signal  measured  from  the  scalp  is 
a  fluctuating  electrical  potential  that  reflects  neuronal  activity  in 
the  underlying  brain  structures.  Most  of  the  signal  energy  is  below 
50  Hz.  The  amplitude  is  typically  5  to  50  yv.  Event-related  potentials 
(ERP's)  are  a  perturbation  of  the  on-going  EEG  elicited  by  a  single 
application  of  a  sensory  stimulus.  The  ERP's  take  place  after  some 
delay  following  the  stimulus  and  last  for  about  a  second.  The  magnitude 
of  the  ERP's  within  the  EEG  are  very  small,  typically  1-20  yv.  Since 
the  ERP  amplitude  is  very  small,  it  is  difficult  to  detect  the  ERP 
signal  buried  in  the  on-going  EEG,  which  is  considered  as  noise  in  this 
case.  ERP's  are  usually  detected  by  averaging  the  EEGs  for  repeated 
stimulus  presentation.  But  for  some  experiments,  this  detection  problem 
is  aggravated  when  only  a  limited  number  of  replications  of  the  evoking 
stimulus  are  available.  In  the  past,  ERP  analysis  has  been  done  almost 
exclusively  by  signal  averaging  [1-3].  Clearly,  if  the  ERP  signal  is 
deterministic,  then  averaging  would  be  a  good  way  to  detect  the 
signal.  However,  the  ERP  can  be  considered  deterministic  only  as  a 
rough  approximation.  Thus,  there  exists  a  need  for  more  sophisticated 
techniques.  In  recent  years,  statistical  pattern  recognition  and 
discriminant    analysis    techniques   have   been   applied   to   process  single 
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ERP's  [4-12],  This  technique  enables  one  to  analyze  ERP's  in  real-time, 
which  is  useful  for  on-line  experiments. 

Pattern  recognition  is  the  process  of  classifying  a  measurement 
data  set  into  mutually  exclusive  classes  based  on  rules  derived  from  a 
previously  obtained  training  data  set.  Some  authors  [13-16]  also 
include  cluster  analysis,  a  closely  related  subject,  as  part  of  pattern 
recognition.  As  in  any  pattern  recognition  system,  the  system  for 
classifying  ERP's  can  be  considered  as  composed  of  two  parts.  These 
parts  are  feature  extractor  and  classifier  as  shown  in  Figure  1.1. 
Usually  the  number  of  dimensions  for  the  measurement  vector  x  is 
large.  This  is  especially  true  for  the  EBIP  signal.  A  large  dimensional 
vector  will  complicate  the  decision  algorithm.  Moreover,  it  is  well 
known  [14,  17-20]  that  when  a  finite  number  of  training  samples  is 
available  for  designing  a  classifier,  there  exists  an  optimal  number  of 
features  for  describing  the  data.  Hence,  there  is  a  need  to  reduce  the 
number  of  dimensions  of  the  measurement  vector  while  trying  to  retain  as 
much  class  discriminatory  ability  as  possible.  The  relationship  between 
training  sample  size  and  the  optimal  number  of  features  to  be  used  has 
been  studied  [21-28].  Roucos  and  Childers  [27-28]  have  given  an 
analytical  result  of  the  relationship  for  these  quantities  for  Gaussian 
data  with  known  equal  covariance  matrices  for  each  of  two  classes.  Jain 
and  Waller  [24],  El-Sheikh  and  Wacker  [25],  and  Raudys  and  Pikelis  [26] 
gave  a  similar  result  but  for  the  unknown  covariance  matrix  case. 
However,  their  formulae  are  very  complicated  and  are  difficult  to 
apply.        The    contribution    of     this    study    is     to    derive    a  simpler 
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Figure  1.1      Conceptual  recognition  system. 
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expression.  We  also  emphasize  the  importance  of  feature  ordering  and 
propose  an  algorithm  for  selecting  features. 

The  second  block  in  Figure  1.1  is  the  classifier.  The  classifier 
or  decision  rule  partitions  the  feature  space  into  c  classes.  Since  it 
has  been  applied  to  many  areas,  for  example,  communications  [29,30]  and 
statistics  [31-35],  its  theory  is  well  established.  From  Figure  1.1, 
one  might  consider  each  block  as  independent,  but  actually  each  block  is 
dependent.  For  example,  one  feature  extractor  method  may  be  optimal 
only  for  one  type  of  classifier.  However,  little  is  known  about  how 
these  blocks  interact. 

The  next  three  chapters  discuss  the  design  of  the  classifier. 
Chapter  2  considers  the  Bayesian  classifier  which  depends  on  a  complete 
knowledge     of     the     underlying     probabilities.  In     most  practical 

situations,  such  a  complete  knowledge  is  not  available  and  parameters 
have  to  be  estimated  from  the  training  samples.  This  is  the  topic  of 
Chapter  3.  In  Chapter  4,  a  different  approach  for  designing  a 
classifier  is  considered.  This  approach  assumes  a  form  for  the 
classifier  first  and  then  estimates  parameters  of  the  classifier  from 
the  training  samples.  Only  a  linear  model  of  this  type  of  classifier  is 
discussed.  Chapter  5  reviews  some  of  the  existing  feature  selection  and 
extraction  techniques.  The  performance  evaluation  of  a  classifier  is 
discussed  in  Chapter  6.  Chapter  7  discusses  the  effects  of  a  finite 
sample  size  on  the  performance  of  Gaussian  classifiers.  The  optimal 
number  of  features  and  a  feature  selection  algorithm  are  given  in  this 
chapter.      These   techniques   are   then  applied  to  ERP  data  in  Chapter  8. 
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Concluding  remarks  are  provided  in  Chapter  9.  Note  that  the  pattern 
recognition  theory  discussed  here  is  far  from  complete.  This  subject 
can  be  found  in  a  number  of  books  [13-15,  36-47].  Only  the  techniques 
believed  to  be  useful  for  processing  ERP's  are  presented  in  this 
dissertation. 


CHAPTER  2 
BAYESIAN  DECISION  THEORY 


Bayesian  decision  theory  is  a  fundamental  statistical  approach  to 
the  classification  problem,  which  is  based  on  the  assumption  that  the 
decision  problem  is  posed  in  probabilistic  terms  with  a  complete 
knowledge  of  all  relevant  probabilities.  In  this  chapter  we  discuss  the 
fundamentals  of  this  theory  and  study  a  special  case  where  the 
distributions  are  normal. 

2.1     Bayes'  Rule  for  Minimum  Risk 

Suppose  that  there  are  c  possible  pattern  classes,   ojj^ ,   oj^,  , 

and  an  arbitrary  pattern  belongs  to  class   o)^  with  a  priori  probability 
c 

P^,  P     >  0  and         P.  =  1.     The  patterns   are  defined   to  be  d-component 
i=l  ^ 

measurement  vectors  or  feature  vectors.  Thus  an  arbitrary  pattern  can 
be  represented  by  a  d-dimensional  vector,  x^.  Pattern  _x  is  assumed  to  be 
a  random  vector  assuming  a  value  in  d-dimensional  feature  space.  The 
vector    is    described    by    a    multivariate    probability    density  function 

p(2lI'^^),  where  pattern  _x_  is  known  to  belong  to  class  (o^,   i=l,  ,  c. 

Let  aj(20    be   some   decision   rule,    i.e.,    a  function  of  _x_  that   tells  us 
which   decision   to   make    for   every   possible   pattern   x_.      For  example, 
a)(x)  =  0)^  denotes   the   decision   to  assign  pattern  jc  to   class   oo^.  Let 
^('^^I'^j)        the  loss  incurred  when  the  decision  ui(x)  =  o)^  is  made,  when 
in  fact  the  pattern  belongs  to  ti)^ ,  i=l,  ,  c,  j=l,  ,  c. 
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Consider  the  problem  of  classifying  an  arbitrary  pattern  x  of 
unknovm  class.  We  must  decide  to  which  class  x  belongs.  The 
probability    of    x_   belonging    to    class  is    the    class    a  posteriori 

probability  P(a)j|_x).    This  probability  can  be  calculated  by  Bayes'  rule 

p(xh.)P. 

P(^j|x)=-^(^  (2.1) 


where 


p(x)  =    I    p(x|a).  )P.  (2.2) 
j=l  ^  ^ 

is  the  unconditional  probability  function  of  _x.  Now  let  us  consider  the 
loss  associated  with  a  particular  decision.  If  we  assign  x_  to  class  i, 
i.e.  oi(x)  =  (jj^,  we  will  incur  the  loss  of  X(  a)_j^  |  oj^ )  with  probability 
P(Wjl2L)«  Consequently,  the  expected  loss  or  conditional  risk  of  making 
a  decision  oJix)  =  o)^  is 

i  ^ 

1  (x)  =     I    X(a).  |a).)P(a).  |x)       .  (2.3) 
j=l        ^    ^  ^ 

Note  that  the  decision  rule  tti{x)  is  a  function  which  tells  us  which 
decision  to  make  for  every  possible  pattern  _x.  Then  the  conditional 
risk  associated  with  this  decision  rule,   oj(x),  is 


l(x)  =     I    X(aj(x)  I  (o.)P((u.  |x)  (2.4) 
j=l  ^  ^ 
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and  the  average  risk  is  given  by 


L  =  l(x)  =  /  l(x)p(x)dx  (2.5) 

where  the  integral  extends  over  the  entire  feature  space. 

Our  problem  is  to  find  a  decision  rule  wCjO  which  minimizes  the 
average  risk  in  (2.5).  Clearly,  this  goal  can  be  achieved  by  taking 
w(.x)  so  that  1(20  in  (2.4)  is  as  small  as  possible  for  every  x.  In 
other  words  Bayes'  decision  mile  a)(x)  can  be  written  as 


a;*(x)  =  OJ^  if  l^(x)  <  l^(x),  j=l,  ,  c      .  (2.6) 


In  case  of  a  tie,  any  convenient  tie-breaking  rule  can  be  used.  The 
associated  minimum  conditional  risk  is 


l*(x)  =      rain  l^(x) 
i=l,  ,c 


c 

min  I    X(uj.  I  (D.  )P((D.  |x)  (2.7) 

i=l,  ,c    j=l        ^    J  J 


and  the  associated  minimum  average  risk  or  Bayes'   risk  is 


L*        =    /  l*(x)p(x)dx      .  (2.8) 
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Note  that  under  these  conditions,  no  other  decision  rule  can  yield  a 
smaller  risk. 


So  far  our  discussion  has  remained  very  general.  In  this  section  we 
consider  a  particular  loss  function.  In  many  problems  errors  need  to  be 
avoided  and  all  errors  are  equally  costly.  Thus  a  minimum  error  rate 
decision  rule  may  be  desired.  A  loss  function  associated  with  this  type 
of  problem  has  the  form 


This  loss  function  assigns  no  loss  to  a  correct  decision  and  assigns  a 
unit  loss  to  any  error.  With  this  loss  function  the  conditional  risk 
of  (2.3)  is  precisely  the  conditional  probability  of 
classification  error  associated  with  the  decision  oiCjc^)  =  oj^,  and  is 
equal  to 


2.2    Bayes'  Rule  for  Minimum  Error  Rate 


X(a)  I  03.)  =  0    if    i  =  j 


(2.9) 


=  1    if    i  j 


ij  =  1,  ,c 


l^x)  =  e\x)  =    I    P(co.  |x) 
i=l  J 


c 


(2.10) 


=  1  -  P(oj^|x),  i=l,  ,c 


Bayes'  decision  rule  in  (2.6)  becomes 


aJ*(x)  =  0)^  if  e  (x)  <  e^(x),  j  =  1,  ,c  . 


(2.11) 
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Decision  rule  (2.11)  can  be  written  in  a  more  familiar  form  using  the  a 
posteriori  probabilities,  giving 

=  0)^.  if  P((..|x)  =  '^'^  P^-^jli)     .  (2.12) 

-         1  i'-       j  =  l,  ,c 

Note  that  from  (2.12),  to  minimize  the  error  rate,  Bayes'  rule  suggests 
selecting  i  to  maximize  the  a  posteriori  probability  P((D^|2c).  The 
associated  conditional  probability  of  classification  error  and  error 
rate  are 


e*(3c)  =  min  e'''(x) 
i=l,  ,c 


=  1  -  max  P((o.  |x)  (2.13) 

i=l,  ,c  ^ 


E*  =     /e*(x)p(x)dx  (2.14) 

where  the  integral  extends  over  the  entire  feature  space. 

2.3    Error  Probabilities  and  Integrals 
Bayes'    decision    rule    (2.12)    can    be    considered    as    a   device  for 

partitioning    the    feature   space   into   c   regions   R^,    i=l,   ,    c.  The 

region  R.  consists  of  those  samples  which  are  classified  into  class  w. 
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=  {x|P((^^|x)  =  max  P(a).  |x)}      •  (2.15) 


j  =  l,  ,c  J 


Thus  the  error  rate  (2.14)  can  be  written  as 


*  ^ 

E    =    I    E*^  (2.16) 
i=l 


where 


E*^  =    /    e*(x)p(x)dx  (2.17) 

R. 

1 

g*"*"     is     the     probability     of     error     associated     with     the  decision 

a;*(x)  =  0).. 
-  1 

To  obtain  additional  insight,   let  us  consider  the  two-class  case. 
Bayes'  decision  rule  partitions  the  feature  space  into  two  regions 


R^  =  {x|P(a)Jx)  >  P('^2l-^^  (2.18a) 


R^  =  {x|P(a)2|x)  >  P(ojJx)>      •  (2.18b) 

These  regions  are  illustrated  in  Figure  2.1  for  a  one  dimensional 
case.    The  error  rate  in  this  case  is 


1  2 
E*  =  E*    +  E* 
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Figure  2.1      Components  of  the  probability  of  error. 
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(2.19a) 


E*  =  P(xeR^,a)2)  +  PCxgR^.oj^) 


(2.19b) 


E*  is  the  entire  shaded  area  in  Figure  2.1.  Note  that  if  the  decision 
line  is  drawn  elsewhere,  the  error  rate  is  always  greater  than  E*.  This 
shows  that  Bayes'  decision  rule  yields  the  lowest  error  rate. 

2.4  The  Two-Class  Case  with  Multivariate  Normal  Distributions 
Our  discussion  so  far  has  dealt  with  arbitrary  probability 
densities.  In  this  section  we  study  the  important  case  of  normal 
distributions.  We  restrict  ourselves  to  the  two-class  case.  The  normal 
or  Gaussian  probability  density  function  is  important  because  of  its 
computational  simplicity  and  because  it  represents  a  realistic  model  for 
many  pattern  classification  situations. 

When  there  are  only  two  pattern  classes,   the  decision  rule  (2.12) 
becomes 


0J*(x)  =  0)^  if  P(a)^|x)  >  P(a)2|x) 
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oj^  otherwise  (2.20) 


Alternatively,  the  decision  rule  of  (2.20)  can  be  written  in  terms  of 
the  likelihood  ratio  (ji^) / p(x\ (li^) 

p(x|aj  )      P(a)  ) 

(2.21) 

=  oj^  otherwise. 

Now   let   us    assume    that    each   class    is    normally   distributed  with 
means  jj^    and    covariance   matrices  i.e.,  p(2c|  to^)   "  N(p.,  1^) '  More 

specifically, 

p(x|ai^)  =  (2Tr)"P/2|   j;j-l/2e^p[_  i/2(x  -  u.)^  I^^x  "  U.)] 

(2.22) 

where  (x  -  Ji^^)*"  is    the    transpose    of  (jc  -  Ji^)  »     1-^  is    the    inverse  of 
^i'     I   Ij^l  is    the   determinant    of         ,  and   p    is    the    dimension   of  _x. 
With  this  density,  the  likelihood  ratio  p(x|  ajj^)/p^x|  ^2)  becomes 

P(xhi)       I   IJ"^/^exp[-  I  (X  -  y^)^  U-up] 

p(x|oo„)  "   ,   Y  1-1/2      ,     I   .  ^t  v-W  M       *  ^^'^^^ 

-    2        I   ^2 1        exp[-  -  (x  -  y^)  (x-Ji2)] 

Since  the  logarithm  is  a  monotonically  increasing  function,  we  can  also 
consider  the  logarithm  of  the  likelihood  ratio  D(x)  which  is 
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p(x I  to  ) 

D(X)     =       log   -7—1  N 


=  -  y  (x  -  Ji^)*^  l^^  (x  -  v^)  +  |<x  -  ji^)^  "  Ji2) 


+  log 


1 1, 


1-1/2  • 


(2.24) 


Consequently,  the  decision  rule  (2.21)  can  be  written  as 


0J*(x)  =  ojj^  if  D(x)  >  log  -  log  P(oj^) 


=  0)^  otherwise  (2.25) 


D(x)  is  sometimes  known  as  a  discriminant  function.  Now  let  us  examine 
three  special  cases. 

r  2 
Case  1 .     > .  =  a  I 

This    simple    case    occurs    when    all    features    are  statistically 

independent  and  have  the  same  variance.     Geometrically,  this  corresponds 

to    the    situation    in    which    the    samples    fall    into    two    equal  size 

hyperspherical   clusters,    with   the   cluster   for   class  tu^  being  centered 

"1  2 

about  the  mean  vector  ji^.     Substituting         =  1/a  I  in  (2.24),  we  obtain 
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D(x)  =  -  -^•  [  -  " 

2a  ^ 


(2.26) 


where 


2  t 
llx  -      U    =  (x  -  u.)  (x  -  y.) 


II  •  II    denotes  the  Euclidean  distance. 

If  the  a  priori  probabilities  P(oj^)  are  the  same  for  both  classes, 

2 

then  log[^(a)2)/P((i)j^)]  =  0  and  the  constant  l/2a  in  (2.26)  becomes 
unimportant  and  can  be  ignored.  In  this  case  the  decision  rule  in 
(2.25)  simply  becomes 


The  classifier  that  uses  the  decision  rule  in  (2.27)  is  known  as  a 
minimum  distance  classifier.  The  means  for  each  class  are  thought  of  as 
an  ideal  prototype  or  template.  An  unknown  pattern  _x  is  decided  as 
belonging  to  class  o)^  if  the  Euclidean  distance  between  x_  and  jj^  is 
smallest  of  all  such  distances.  This  is  essentially  a  template  matching 
procedure. 

If  the  a  priori  probabilities  are  not  equal,  then  the  decision  rule 
in  (2.25)  becomes 


aj*(x)  =  u)    if  D(x)  =  [llx  -  jj„ll^  -  llx  -  u,  11^]  >  0 


=  oo„  otherwise. 


X 


(2.27) 
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w*(x)  =        if  D(x)  >  0 


=  w  otherwise 


(2.28a) 


where 


llx  -  y.^  11^  -  logP(a)^)]  . 


(2.28b) 


Note   that   the   distance   is   normalized   by   the  variance  a  and   is  reduced 
log  P((»J^).      This    can    be    considered    a    generalized    minimum  distance 
classifier. 

By  noting  that  x^  =  JJ^'x,  the  discriminant  function  (2.28b)   can  be 
written  as 


which  is  a  linear  discriminant  function.  The  classifier  that  uses  such 
a  linear  discriminant  function  is  called  a  linear  machine. 


In    this    case    both    classes    have    an   identical    covariance  matrix. 
Geometrically,    this   corresponds   to   the   situation   in  which   the  samples 


D(x)  =  -\  [2x^(u,  -  uj  -  /  y 
2  a  i        ^  i 


(2.29) 


Case  2.  >. 

^i 
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fall  in  hyperellipsoldal  clusters  of  equal  size  and  shape,  the  cluster 
for  the  class  o)^  being  centered  about  the  mean  vector  jj^.  The  decision 
rule  for  this  case  is 


aJ*(x)  =  0)^  if  D(x)  >  0 


=  oj^  otherwise  (2.30a) 

where 


D(x)  =  [■kx-Ji,)''  r^x  -  ji,)  -  log  P(aj„)]  - 


i\  (x  -  Jii)''l^^x  -  Ji^)  -  log  P(a)^)]     .  (2.30b) 


Note  the  similarity  between  (2.30)  and  (2.28).    Again  D(jO  in  (2.30)  can 

be   considered   as   a   generalized  minimum  distance   discriminant  function 

t  r-1 

where  in  this  case  the  distance  is  (x  -  _u^)  i  (x  -  V^)  which  is  known 
as  the  Mahalanobis  distance.  The  discriminant  function  in  (2.30)  can  be 
written  as  a  linear  function  of  x^  similar  to  (2.29). 

D(x)  =  x*"  r^Jdi  -  Ji2)  -  i  (Jii  +  -H2^''  ^'^^^1  "  -^2^ 


+  log  P(a)p  -  log  P  ((0^)  (2.31) 
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when  the  a  priori  probabilities  P(co^)  are  identical  for  both  classes, 
the  last  two  terms  in  (2.31)  cancel  out,  and  becomes 

D(x)  =  (x^  -  I  (y^  +  ii2)*')  (Jii"  Ji2)  (2.32) 

which  is  known  as  the  Anderson  linear  discriminant  plane.  We  will  see 
in  a  later  chapter  that  this  discriminant  function  is  also  equivalent  to 
the  Fisher  linear  discrimination  function. 

Case  3.       1^  arbitrary. 

The  decision  rule  for  this  general  case  is  (2.24)  and  (2.25).  It 
is  quadratic  in  nature  and  cannot  be  reduced.  The  decision  surfaces  are 
hyperquadrics ,  and  can  assume  any  of  the  general  forms  -  pairs  of 
hyperplanes,  hyperspheres,  hyperellipsoids,  hyperparaboloids,  or 
hyperhyperboloids  of  various  types. 


CHAPTER  3 
PARAMETER  ESTIMATION 


In  Chapter  2  we  considered  the  problem  of  designing  an  optimal 
classifier  where  the  a  priori  probabilities  P(oj^)  and  the  class- 
conditional  densities  p(x|  u)^)  were  completely  known.  Unfortunately,  in 
pattern  recognition  applications  this  is  usually  not  the  case.  In  a 
typical  situation  we  may  have  some  vague,  general  knowledge  about  the 
problem  together  with  a  number  of  design  samples.  Then  the  problem  is 
to  find  some  way  to  use  this  information  to  design  a  classifier. 

One  approach  to  this  problem  is  to  use  the  training  (design) 
samples  to  estimate  the  unknown  probabilities  and  probability 
densities.  The  resulting  estimates  are  used  as  if  they  were  the  true 
values.  "In  a  typical  situation,  the  estimation  of  the  a  priori 
probabilities  presents  no  serious  problem.  However,  estimation  of  the 
class-conditional  densities  is  not  so  simple.  The  number  of  training- 
samples  always  seems  too  small  and  a  serious  problem  arises  when  the 
dimensionality  of  the  feature  vector  x_  is  large.  In  this  chapter  we 
di  scuss  how  to  estimate  these  class— conditional  densities.  We  postpone 
the  discussion  of  the  effects  of  finite  training  sample  size  until 
Chapter  7. 

For  the  estimation  procedures  discussed  here,  we  assume  that  the 
general  knowledge  about  the  problem  permits  us  to  assume  a  form  for  the 
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density  p(  •  ;  _6)  except  that  it  contains  unknown  parameters  _6  (if  9 
were  known,  the  density  function  would  be  completely  specified).  This 
problem  is  known  in  statistics  as  parametric  point  estimation  [48]. 
Other  methods  exist,  which  do  not  require  knowledge  about  the  form  of 
density.  These  methods  are  known  as  nonparametric  methods  [14,  49, 
50].  In  general,  the  nonparametric  methods  require  a  large  number  of 
training  samples.  For  the  problem  of  classifying  EEG  data,  a  large 
number  of  training  samples  are  usually  not  available.  Thus, 
nonparametric  estimation  will  not  be  treated  here. 

The  parametric  point  estimation  problem  can  be  approached  in 
several  ways.  We  shall  consider  in  the  next  two  sections  two  common 
procedures,  maximum  likelihood  estimation  and  Bayesian  estimation 
respectively.  To  simplify  our  treatment,  we  shall  assume  that  training 
samples  from  class  oj^  give  no  information  about  J_.  if  i^j.  That  is 
equivalent  to  assuming  that  the  parameters  for  the  different  classes  are 
functionally  independent.  This  permits  us  to  work  with  each  class 
separately  and  to  simplify  our  notation  by  not  having  to  indicate  the 
class  distinction.  Thus  we  now  have  c  separate  problems  for  the  c-class 
problem.      Each  separate  problem  is   the   following  form.      Use  a  set  of 

samples,   x  =  {x,  ,  x^}  drawn       independently       according       to  the 

probability  law  p(2c  ;  _e)  to  estimate  the  unknown  parameter  vector  _e.  _9 
is  written  explicitly  to  indicate  the  dependence  of  the  density  p{x)  on 
the  parameter  vector  9. 
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3.1     Maximum  Likelihood  Estimation 
Maximum  likelihood  methods  view  the  parameters  as  quantities  whose 
values   are  fixed  but  unknown.     The  best  estimate  is  defined  to  be  the 
one    that   maximizes    the   probability   of    obtaining   the   samples  actually 

observed.     Suppose  that  x  contains  n  samples,   x  =  {x^,  ,  Since 

the  samples  were  drawn  independently,  then 

n 

p(x  ;  _9)  =    n    p(x,    ;  i)  (3.1) 
k=l 

p(x  ;   B)  is    called    the    likelihood    of   _9  with   respect    to    the   set  of 

samples.     The  maximum  likelihood  estimate  of  _£  is,   by  definition,  the 

value  9  that  maximizes  p(x  ;  9).  In  some  sense,  this  value  corresponds 
to  the  value  of  _9  that  best  agrees  with  the  actually  observed  samples. 

A 

Since  the  logarithm  is  monotonically  increasing,  the  _e  that 
maximizes  the  log-likelihood  also  maximizes  the  likelihood.  Also,  it  is 
usually  easier  to  work  with  the  logarithm  of  the  likelihood  than  with 
the  likelihood  itself.    Let  l(^)  be  the  log-likelihood  function,  then 

1(1)  =  log  p(x  ;  _9)       .  (3.2) 

Substitute  (3.1)  in  (3. 2), then 
n 

1(9)  =    I    log  p(x^  ;  _9)       .  (3.3) 
~       k=l  ~ 

If  1(0)  is      a     dif f erentiable     function     of  _9,   Q_  can     be      found  by 
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differentiate  l(_e)  with  respect  to  _6  and  equating  this  derivative  to  0. 

Let  _e   be    the    p-component    vector  ^=  (6   ,   6        and    let  V    be  the 

1  p  6 

gradient  operator,  then  with 


3e, 


30 


(3.4) 


we  have 


n 


^qK  9)  =  I  V  log  p(x,  ;  _e) 
-  k=l      -  ^ 


(3.5) 


The  vector  _6  can  be  found  by  solving  (3.5). 

As  an  example,  suppose  that  the  samples  x  =  {x, ,   ,  x  }  are  drawn 

I  n 

from   a   normal   population   with   mean    u  and   variance    a     (Consider  the 

2 

univariate   case   for   simplicity).      That    is   denoted   as  x  ~  N( y ,  a  ) .  We 

know   that    x  is   normally   distributed   and   samples  x  =  {x,  ,  ,  x  }  are 

1  n 

available.     The  problem  is  to  find  the  maximum  likelihood  estimate  of  u 

and  0^.     (If  we  know  the  parameters  u  and  o^,   then,   of  course,  p(x)  is 

2 

completely  known).    Let  9^  =  y  and  9^  =  a  .  Here 


log  p(xj^  ;  i)  =  -  I  log  2Tr92  "  Yq"  (x^  -  9^' 


and 


Vq  log  p(Xj^  ;  B) 


^  (x    -  9  ) 

02      k  1 


1  ,  <^  -  V 
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Then,  Eq.  (3.5)  leads  to  the  conditions 


n 

I 

k=l  0, 


V  1 


2 
and 

n      ,        n      (x    -  9  )^ 

k=i  k=i 

A  A 

where  9^^  and  9^  are    the    maximum    likelihood    estimates    for    9j^    and  02 

A  A  A  2  A 

respectively.  By  substituting  y  =  9^ ,  a  =  9^  and  rearranging,  we 
obtain  the  maximum  likelihood  estimates  for  y  and  o^;  denoted  as 


1  " 

U=-    I  (3.6) 
"  k=l  " 


^9        1      ^  '^7 

n    ^    ^^k  -  •  (3.7) 

k=l 


Similarly  for  the  multivariate  case,   one  can  show  that  the  maximum 
likelihood  estimates  for  jj  and     J  are  given  by 


Ji  =  -    I  (3.8) 
"  k=l  ^ 


and 


1    "  "  "  t 


(3.9) 
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Note  that  the  maximum  likelihood  estimate  for  the  mean  vector  is 
the  sample  mean.  The  maximum  likelihood  estimate  for  the  covariance 
matrix  is  the  average  of  the  n  matrices  (jc^  -  \0(^  ~  Ji)  •  This  is  a 
quite  satisfying  result  since  the  true  covariance  matrix  is  the  expected 
value  of  the  matrix  (x  -  jj)(x  -  jj)^.  However,  it  is  well  known  that  the 
maximum  likelihood  estimate  for  a  covariance  matrix  is  biased;  that  is, 
the  expected  value  of  ^  is  not  equal  to  J^.  An  unbiased  estimate  for  ^ 
is  given  by  the  sample  covariance  matrix 

^    =^    I  ii)(25fe  -  Ji)       .  (3.10) 

k=l 

3.2    Bayesian  Estimation 
The  basic  assumptions  for  Bayesian  estimation  are: 

(1)  The  form  of  the  density  p(2c|_9)  is  assumed  to  be  known,  but  the 
value  of  the  parameter  vector  _e  is  not  known. 

(2)  The  parameter  vector  _9  is  assumed  to  be  a  random  variable.  Our 
initial  knowledge  about  _6  is  assumed  to  be  contained  in  a  known 
a  priori  density  p(_8). 

(3)  The  rest  of  our  knowledge  about  _e  is  contained  in  a  set  x  of  n 

samples    x^,   ,  drawn    independently    according    to  the 

unknown  probability  law  . 

The  problem  is   to  compute  p(x | x) ,  yielding  the   best  approximation 
of  .     Note  that   the  difference  between  the  Bayesian  method  and  the 

maximum  likelihood  method  is  the  second  assumption.    The  Bayesian  method 
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views    the   parameter   vector  _e   as    a   random   variable    having   a   known  a 
priori  density  p(_e).     The  maximum  likelihood  method  views  the  parameter 
vector  _e  as  a  quantity  whose  value  is  fixed  but  unknown. 
To  obtain  p{x\  x)  note  that 

P(2ih)  =    hU,±\x)dQ  (3.11) 

where  the  integral  extends  over  the  entire  parameter  space.  Now  we  can 
write 

p(x,_e|x)  =  p(x|^,x)p(^|  X)     .  (3.12) 

Since  x_  and  x  are  independent,  that  is  the  distribution  of  _x  is 
completely  known  once  we  know  the  value  of  _9.  Then  the  first  factor  in 
the  right  side  of  (3.12)  is  merely  p(x|_9).    Thus  from  (3.11)  and  (3.12) 

p(x|x)  =    /p(x|_9)  p(_6|x)d_9      .  (3.13) 

This  equation  links  the  desired  density  p(x| x)  to  the  a  posteriori 
density  p(_9| x)  for  the  unknown  parameter  vector.  The  only  unknown  in 
(3.13)  is  the  a  posteriori  density  p(_9|x).  To  calculate  p(_9|x),  by 
Bayes '  rule 


p(x|i)p(l) 

p(_9|x)=-  .   (3^14) 
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and  by  the  independence  assumption 


n 

p(x|i)  =    n     pU\B)      .  (3.15) 
k-1  ^ 


This    constitutes    the    formal    solution   to    the   problem.      We    show  its 

relation  to  the  maximum  likelihood  solution.     Suppose  that  p(x|e)  has  a 

sharp  peak  at  _e  =  _9.    If  the  a  priori  density  p(_e)  is  not  zero  at  e  =  9 

and      does      not      change      much      in      the      surrounding  neighborhood, 

then  p(_e|  x)  also    peaks    at  _e  =  _e.      Thus    (3.13)    shows    that  p(x|  x)  is 

approximately  p(x|_9),  a  result   that   one  would  obtain  using  the  maximum 

likelihood  estimate  as  if  it  were  the  true  value. 

To  illustrate   the  method,    let  us   consider   the  normal  case.  For 

simplicity,   we  will  consider  the  univariate  case.     Thus   the  density  is 
2 

p(x|p)  ~  N(y,a  ).  Suppose  that  the  mean  \i  is  the  only  unknown 
parameter  ( is  known) .  We  assume  that  whatever  prior  knowledge  we 
might  have  about  u  can  be  expressed  by  a  known  a  priori  density  p(y)  and 
we  shall  further  assume  that 

p(y)  -  N(u^,aJ  (3.16) 

2 

where    both  u    and  a    are    known.       Generally    speaking,  u    is    our  best 
o  o  o 

2 

initial  guess  for  u,  and  is  the  measure  of  the  uncertainty  of  this 
guess.  The  assumption  that  the  a  priori  distribution  for  y  is  normal 
will  simplify  the  problem.  However,  the  important  point  is  not  that  the 
a    priori    distribution    for    p    is    normal,    but    that    it    exists    and  is 
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known.     Suppose  now  that  n  samples       ,  ,        are  independently  drawn 

from  the  resulting  population.     Let  x  =  {x^ ,   ,  x^},  by  Bayes'  rule  in 

(3.14),  we  obtain 

p(y|x)  =  P(Ht^)p(t^) 
/p(x|  u)p(y)dp 

n  (3.17) 

=  a    n     p(x  |u)p(u) 
k=l  ^ 

where    a  is   a  scale   factor   that   depends  on  x  but   is   independent   of  u. 

Equation  (3.17)  shows  how  the  observation  of  a  set  of  samples  affect  our 

ideas  about  the  true  value  of  \i,  changing  the  a  priori  density  p(ij)  into 

an       a       posteriori       density  p(  y|  x) .  Since  p(x|u)  "Ndi.a^)  and 
p(y)  "  N(ii^,a^), 


p(w|x)  = 


n 

a  n 
k=l 


/Tir 


exp  [-  }  (^)] 


1 

/  2tt  0 


exp  [-  -  (- 


y-u. 


-)] 


=  a'  exp[-  k  I 


2  2 


k=l 


(3.18) 


where  a'  and  a"  are   factors   that  do  not  depend  on  y.     p(u|x)  in  (3.18) 
is    still    a    normal    distribution    for    any    number    for    samples,    n,  and 
p(u|x)  is     said     to     be     a     reproducing     density.         We     can  write 
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2 

p(u|><)  "  N(y^,a^)  where 
2 

na  2 
=      2  I    2  \         2\    2  %  (3.19) 

o  o 

0^  =  — ^  2  (3.20) 


a 


where  m^^  is  the  sample  mean 


"      "  k=l 


Roughly     speaking,        represents     our     best     guess     for     y  after 

2 

observing  n  samples,   and       measures   our  uncertainty  about  this  guess. 
2 

Since  decreases  raonotonically  with  n,  each  additional  observation 
decreases  our  uncertainty  about  the  estimate  of  y.  As  n  increases, 
p(y|x)  becomes  more  and  more  sharply  peaked,  approaching  an  impulse 
function  as  n  approaches  infinity.  This  behavior  is  generally  known  as 
Bayesian  learning  (see  Fig.  3.1). 

It  is  obvious  from  (3.19)  that  y^  is  a  linear  combination  of  the 
sample  mean  m^  and  the  initial  guess  y^,  with  coefficients  that  are 
nonnegative  and  sum  to  one.  Thus,  y^  always  lies  somewhere  between  m^^ 
and  y^.  If  *  0,  y^  approaches  the  sample  mean  m^  as  n  approaches 
infinity.  If  =  0>  we  have  a  degenerate  case  in  which  our  a  priori 
certainty  that  y  =  y^  is  so  strong  that  no  number  of  observations  can 
change    our    opinion.       At    the    other    extreme,     if        »  a,  we    are  so 
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Figure  3.1      Learning  the  mean  of  a  normal  density. 
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uncertain  about  our  a  priori  guess  that  we  take  y    =  m  ,  using  only  the 

n       n  o  J 

samples  to  estimate  u. 

Having  found  the  a  posteriori  density  p(m|x),  we  must  next  find  the 
density  p(x|x).     From  (3.13), 


p(x|  x)  =    /  p(x|  y)p(u|  x)dy 


2  ,  ,   y-p  2 

1  r     1  /  n. 


=  /  — ^  expt-  ]    ~—  exp[-  ^)  ]dy 

n 

f  ^2 

1  1  ^^~^n^ 

=  Tif^^^Pt-  2      2    2  1  ^^^'''n^ 
n  a  +a 

n 


where 


2,2  2  2 

,   a  +0  0  x+o  y  2 

f(a,aj  =    /exp[- ^-y^  (U  -  -^y^)     ]dy  . 

a  a  0+0 
n  n 

Again,  p(x| x)  is  normally  distributed  with  mean  y    and  variance  o  +o 

. .  n  n 


p(x|x)   ~  N(y^,o^  +  0  )       .  (3.22) 


In  summary,  the  density  p(x|x)  whose  parametric  form  is  known  to  be 

p(x|y)  ~  N(y,o^),  we    merely    replace    y   by    y     and    o^   by  o^  +  o^.  In 

n  n 

effect,  the  conditional  mean  y^  is  treated  as  if  it  were  the  true  mean, 
and  the  known  variance  is  increased  to  account  for  the  additional 
uncertainty  in  x  resulting  from  our  lack  of  exact  knowledge  about  the 
mean     y.        From    the     density  p(x|x),  which     is     actually     the  class- 
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conditional  density  p(x  |  oj^ ,  x^) ,  and  together  with  the  a  priori 
probability  P(w^)  we  obtain  the  needed  information  to  design  the  Bayes 
classifier. 


CHAPTER  4 
LINEAR  DISCRIMINANT  FUNCTIONS 


In  Chapter  2  we  assumed  that  the  forms  for  the  underlying 
probability  distributions  were  known  and  used  training  samples  to 
estimate  the  values  of  the  distributions  parameters.  This  method  may 
lead  to  a  complicated  discriminant  function  which  could  be  difficult  to 
implement.  In  this  chapter,  we  take  a  different  approach  by  assuming 
the  form  of  discriminant  function,  and  then  use  the  training  samples  to 
estimate  the  values  of  its  parameters.  We  will  examine  various 
procedures  for  determining  these  parameters.  These  procedures  have  been 
regarded  as  the  learning  method.  None  of  these  procedures  require 
knowledge  of  the  underlying  distributions,  consequently  we  can  select 
any  form  of  the  discriminant  function  we  desire.  One  of  the  most 
popular  discriminant  functions  is  the  linear  function,  because  this 
function  can  be  implemented  easily  and  is  optimal  in  some  cases  if  the 
underlying  distributions  are  of  the  proper  form.  Throughout  this 
chapter  we  will  be  concerned  with  the  linear  discriminant  function 
only.  We  will  also  limit  ourselves  to  the  two-class  case.  The 
multiclass  case  is  a  straightforward  extention  of  the  two-class  case  and 
can  be  found  in  a  number  of  texts  on  pattern  recognition 
[14,37,42,45,47]. 
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4.1  Linear  Discriminant  Function  for  the  Two-Class  Case 
A  linear  discriminant  function  has  a  form 


g(x)  =  w  X    +  w„x_  +           +  w  X    +  w 

—         11        L  L  pp  o 


t  ^ 
W  X  +  w 
 o 


(4.1) 


The  p-dimensional      vector  w  =   '"p^^  called     the  weight 

vector  and  w^  is  the  threshold  weight.  The  decision  rule  corresponding 
to  the  discriminant  function  g(20  is 

A 

u<x)  =  \     if  g(x)  >  0 

(4.2) 

=  a)^  otherwise 

If  g(x)  =  0,  an  arbitrary  class  can  be  assigned.  The  decision  rule 
partitions  the  feature  space  into  two  regions,  one  where  g(x)  >  0  and 
another  where  g(2c)  <  0.  The  decision  boundary  is  determined  by 
g(2c)  =  0.  Since  g(2c)  is  a  linear  function  in  _x,  the  decision  surface 
is  a  p-dimensional  hyperplane  and  is  given  by 

w^x  =  -w      .  (4.3) 

  o 

Let  us  review  some  geometric  properties  of  the  hyperplane.     Let  Xj^ 

and        both  lie  on  the  hyperplane,  then  we  have 
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X,  +  w  =  w^x„  +  w  =0 
—   —1        o  1  o 


Thus 


E^^l  -         =  ^      '  (4.4) 


Since  the  vector  (x^  -  X2)  ^^^^        the  hyperplane,   (4.4)   shows  that  the 

vector        is    normal    to    the    decision    surface    and    thus    defines  the 

orientation  of  the  hyperplane.  The  normal  Euclidean  distance  from  the 
origin  to  the  hyperplane  is 

t 

—   —1  ^o 


With  a  proper  normalization,  the  threshold  weight  w^  defines  the 
location   of    the    hyperplane.      Let   r   be    the    signed,    normal  Euclidean 

distance    from    an    arbitrary    point  jc'    to    the  hyperplane,    where    r  is 

positive  if  g(,^  >  0  and  negative  if  g(_x)  <  0.  Then  r  is  equal  to  (see 
Figure  4.1) 

w*"  w*'x'+w        g(3C ' ) 

r  =nMr(x'  -         =^^  =  -i^    .  (4.6) 

Thus  the  distance  r  is  proportional  to  g(.x) . 

In  short,  the  linear  discriminant  function  partitions  the  feature 
space  by  the  hyperplane  into  two  regions.  The  orientation  of  the 
hyperplane  is  determined  by  the  normal  vector  _w  and  the  location  of  the 
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Figure  4.1      The  geometry  of  the  hyperplane    g(2c)  =  0. 
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hyperplane  is  determined  by  the  threshold  weight  w^.  In  particular,  if 
the  threshold  weight  =  0,  the  hyperplane  passes  through  the  origin. 
The  origin  is  on  the  positive  side  (i.e.,  when  g(x)  >  0  )  if  w^  >  0,  and 
on  the  negative  side  when  gCjO  <  0.  The  discriminant  function  g(x)  is 
proportional  to  the  signed  distance  from  _x  to  the  hyperplane. 

Before  we  consider  methods  for  determining  the  weights  of  the 
discriminant  function,  let  us  introduce  a  convention  for  our  notation. 
Let  ^  denote  the  (p+1)  dimensional  vector 

X  =  [x^ .x^   >^p»l^^ 

(4.7) 

Let  _a  denote  the  (p+1)  dimensional  weight  vector 
a=  [w^,W2,--,Wp,w^]^ 

(4.8) 

=  [w,  w^]^  . 
With  this  notation,  (4.1)  can  be  written  as 

g(x)  =  a^z     '  (^-9) 
The  vector  ^  is  known  as  the  augmented  pattern  vector. 
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^•2  Deterministic  Learning  Algorithm 
Now  we  shall  investigate  some  nonstatistical  methods  for 
determining  the  weight  vector  a.  The  first  method,  originated  by 
Rosenblatt  [51]  is  known  as  the  perceptron  learning  algorithm.  This 
algorithm  assumes  that  the  training  samples  are  linearly  separable.  The 
second  algorithm  to  be  discussed  is  known  as  the  least  mean-squared- 
error  procedure.  The  training  samples  for  the  second  procedure  can  not 
necessarily  be  linearly  separated.  The  training  samples  are  said  to  be 
linearly  separable  if  there  exists  a  linear  classifier  that  can  classify 
them  correctly.  In  other  words  there  exists  a  vector  _a  such  that 


A^Zi  >  0  ei^i  (4.10a) 


a 


"Xl  <  0  ^  ^2     '  (4.10b) 


To  simplify  this  treatment,  let  us  introduce  a  useful  convention. 
Note  that  (4.10b)  can  be  written  as  ^('X^)  >  0,\f^^  eco^.  This  suggests 
a  normalization  procedure  achieved  by  replacing  all  the  samples  from 
class  by  their  negative  values.  With  this  normalization,  we  can 
forget  the  labels  assigned  to  the  training  samples  and  look  for  a  weight 
vector  a_  such  that  a*^^^  >  0.  This  convention  will  be  used  throughout 
this  section. 

Other  deterministic  learning  algorithms  exist  than  the  two  we  shall 
discuss,  e.g.,  the  potential  function  approach,  the  gradient  technique, 
etc.   [14,  45,  47].     However,  we  will  not  discuss  these  here. 
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4.2.1.    The  Perceptron  Learning  Algorithm 

Suppose  that  we  have  a  set  of  n  samples  S    =  ^^1'  '  ^  ^'  so'^is 

labeled  oj^  and    some    labeled  u)^  and  is    normalized,     i.e.  is 

replaced  by  -y^  if  y^  e  0)^.  Assume  that  these  training  samples  are 
linearly  separable,   i.e.,  there  exist  a  weight  vector _a  such  that 

=  f^^Zi  >    0  ^         •  (^-11) 

Our  problem  is  to  find  such  a  vector.  This  problem  is  essentially  one 
of  solving  a  set  of  linear  inequalities,  which  can  be  done  in  many 
ways.  One  such  method  is  an  iterative  procedure,  which  we  now 
describe.  Let  _a_(k.)  denote  the  weight  vector  at  kth  iterative  step  of 
the  algorithm,  and  _^(k)  denote  the  kth  sample  at  kth  step  obtained  by 
scanning  the  sample  set  repeatedly  and  cyclically,  i.e.  ^(k)  =  if 
i  =  k  mod  n. ,  k  =  1,  2,   .     The  algorithm  can  be  written  as 

a(k+l)  =  a(k)  +  c(k)2(k)     if  a'^(k)2(k)  <  0  (4.12a) 

=  a(k)  otherwise  (4.12b) 


where  c(k)  is  the  positive  correction  factor  at  kth  step. 

The  idea  of  this  algorithm  is  to  leave  the  weight  a_  unchanged  if  it 
classifies  the  sample  correctly  and  move  the  weight  vector  in  the 
"right"   direction  when   it   makes   a  mistake.      The  solution  is  obtained 
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when  _a  correctly  classifies  all  n  samples.  Note  that  the  solution  is 
not  unique  and  can  be  any  vector  in  the  solution  region  (see 
Figure  4.2).  The  initial  weight  can  be  assumed  arbitrarily.  To  gain 
some  insight  into  how  the  algorithm  works,  let  us  examine  the  kth 
step.    Assume  belongs  to  oj^     If  _a^(k)2(k)  >  0,  the  weight  vector  a_ 

is  left  unchanged,  otherwise  _a(k+l)  =  _a(k)  +  c(k)^(k).    Note  that 

a)^(k+l)x(k)  =  _a(k)'^2.(^)  +  c(k)^ik)^ik)  >  a^(k)x(k)      .  (4.13) 

Thus  the  weight  adjustment  is  trying  to  correct  the  error,  which  it  may 
or  may  not  correct.  This  is  controlled  by  the  correction  factor  c.  If 
c(k)  is  independent  of  k,  the  procedure  is  called  the  fixed-increment 
rule.  If  c(k)  is  the  smallest  integer  that  makes  _a'(k+l)^(k)  >  0,  i.e. 
the  smallest  integer  greater  than  |_a*'(k)^(k)  |  /  |^''(k)^(k)  |  ,  the  procedure 
is  called  the  absolute  correction  rule.  Another  possibility  is  to  take 
c(k)  as  a  fraction  of  |_a^(k)^(k)  |  /  |2.^(k)^(k)  |  ,  this  procedure  is  known 
as  the  fractional  correction  rule.  It  should  be  noted  that  if  the 
correction  factor  is  too  small,  the  algorithm  will  take  longer  to 
converge.  But  if  the  correction  factor  is  too  large,  the  algorithm  may 
overcorrect.  The  proof  of  the  covergence  of  the  algorithm  can  be  found 
in  [14,45,47,52]. 
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Figure  4.2 


Linearly  separable  samples  and  the  solution  region  in  weight 
space. 
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The  Least  Mean-Squared-Error  Procedure 

The  procedure  in  the  last  subsection  depends  on  the  assumption  that 
training  samples  are  linearly  separable.  If  this  is  not  the  case,  the 
algorithm  will  never  converge.  Unfortunately,  sufficiently  large 
experimental  design  sets  are  almost  certainly  not  linearly  separable. 
We  shall  now  consider  an  algorithm  which  does  not  require  linearly 
separable  training  samples.  This  algorithm  pays  attention  to  all  the 
training  samples,  whereas  the  last  procedure  focussed  only  on  the 
misclassif ied  samples.  Previously  we  have  sought  a  weight  vector  a_y 
making  the  inner  products  >  0.     In  this  section,  we  are  trying  to 

make  ^Jj^  ~  ^-j^  where  the  b^  are  some  arbitrarily  specified  positive 
constants.  Thus,  we  have  replaced  the  problem  of  solving  a  set  of 
linear  inequalities  with  a  problem  of  solving  a  set  of  linear  equations. 

To  ease  the  discussion,  let  us  introduce  a  matrix  notation.  Let  Y 
be  the  n  by  (p+1)  matrix  whose  ith  row  is   the  vector  j^,   and  b_  be  the 

column  vector   (b^ ,  ,    b^)*-.      Then   our   problem   is    to   find   a  weight 

vector  a_  satisfying 

Ya=_b      .  (4.14) 

If  Y  is  a  square  nonsingular  matrix,  we  would  obtain  a_  =  Y~^_b.  However, 
Y  is  usually  a  rectangular  matrix  with  more  rows  than  columns  or,  in 
other  words,  there  are  more  equations  than  unknowns.  The  vector  a_  is 
overdetermined  and  generally  there  is  no  solution.     However,  we  can  seek 
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a  weight  vector  _a  that  minimizes  the  sum  of  squares  of  the  error 
e  =  Ya_  -  _b.    This  is  equivalent  to  minimizing 


Vj(_a)  =  IIYa  -  b_ll^  =     I    {a^^^^  -  b^)^      .  (4.15) 

i=l 


A  solution  can  be  found  by  forming  the  gradient  of  J(^)  and  setting  it 
to  zero,  i.e., 


n 

Vj(a)  =     I    2{a\    -  b  )y. 

i=l  111 


=  2Y'^(Ya  -  b)  =  0  (4.16) 


This  yields  the  condition 


Y*^Ya  =  Y*^b.  (4.17) 


Y^Y  is  usually  a  nonsingular  square  matrix.     Then  a^  can  be  obtained  by 
a  =  (Yhrhh 


=  Y*b  (4.18) 


where 


Y**^  =  (Y^)"^'^ 


is  called  the  pseudoinverse  [53]. 
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It  is  clear  that  the  solution  vector  _a  depends  on  the  margin  vector 
b^.       Different    values    of    b_  give    different    solution    vectors    a_  with 
different  properties.     Thus,   for  an  arbitrary  value  of  vector  _b,  there 
is    no    reason    to    believe    that    the   LMSE    solution   yields   a  separating 
vector   in   the    linear   separable    case.      However,    it   is   reasonable  to 
expect   that   the   solution   obtained   gives   a   reasonable   result   in  both 
separable  and  nonseparable  cases.     To  support  this  expectation,  Koford 
and  Groner   [54]   have  shown  that  with  a  proper  value  of  vector  _b,  the 
LMSE   discriminant   function   a^^  is   directly   related   to  Fisher's  linear 
discriminant  which  will  be  discussed  in  the  next  section.     Patterson  and 
Womack   [55]    showed  that   the  LMSE  solution  algo  gave  a  minimum-squared- 
error  approximation  to  the  Bayes'  discriminant.    There  is  a  variation  of 
this    procedure    known    as    Ho-Kashyap    algorithm     [56]     which    gives  a 
separating  vector  for  the  linear  separable  case. 

4.3  Fisher's  Linear  Discriminant 
The  linear  discriminant  function  in  this  section  is  rather 
different  from  those  in  the  previous  section.  Fisher's  linear 
discriminant  does  not  solve  the  classification  problem.  Its  purpose  is 
to  reduce  the  dimensionality  of  the  data  from  p-dimensions  to  one 
dimension  and  still  retain  class  separability.  More  specifically,  we 
want  to  find  a  linear  transformation  that  maps  the  data  from  feature 
space  into  one  dimension  and  this  mapping  should  maximize  some  criterion 
functions.  This  criterion  function  serves  as  the  indication  of  class 
separability. 
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Suppose  that  we  have  a  set  of  n  p-dimensional  samples  X  =  {jc^ ,  , 

_x^}  in  which  samples  come  from  class  to^  and  n2  =  n-n^  samples  come 
from  If  we  form  a  linear   combination   of   the   components   of  x,  we 

obtain  a  scalar 


2  =  (4.19) 


for  each  of   the  n  samples,   i.e.,  Z  =   {z^,  ,   z^}.     Geometrically,  if 

llwfl  =  1,  each  is  the  projection  of  the  corresponding  onto  a  line 
in  the  direction  of  w.  This  is  illustrated  in  Figure  4.3.  Actually, 
the  magnitude  of  w_  is  not  important.  It  is  merely  a  scaling  factor. 
The  important  thing  is  the  direction  of  w.-  Figure  4.3  also  illustrates 
that  in  one  direction,  namely  _w,  we  obtain  a  good  separation  of  the 
classes,  while  in  another  direction,  _w ' ,  the  data  is  completely  mixed. 

To  obtain  a  good  separation,  we  would  like  the  projected  points 
from  the  same  class  to  cluster  together,  but  the  two  clusters  should  be 
separated.  Thus  we  would  like  the  projected  points  to  have  a  large 
difference  of  the  sample  means  and  have  small  within  group  variances. 
If  m^  is  the  p-dimensional  sample  mean  given  by 

-2i  =  ^      ^         5  (4.20) 

i    xe  X. 
-  1 

then  the  sample  mean  for  the  projected  points  is  given  by 


m. 


i    ze  z. 

1 
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Figure  4.3      Projection  of  samples  onto  a  line. 
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=  ^     I  '  ' 


^  w  X    =  w  m  (4.21) 

i  xe 

It  follows  that  the  difference  of  the  sample  means  of  the  projected 
points  is  given  by 


Define  the  scatter  of  the  projected  samples  for  class  o)^  as 

=     I        (z  -  m  )2,  1  =  1,2  (4.23) 
ze 

Thus,  an  estimate  of  the  variance  of  the  pooled  data  is  (1/n) 
(S^  +  S|).  Then,  rather  than  using  the  variance  of  the  pooled  data, 
we  can  use  (S^  +  s|)  which  is  known  as  the  within-class  scatter  of  the 
projected  samples.  The  Fisher  linear  discriminant  [57]  is  then  defined 
as  the  linear  function  w'2L         which  the  criterion  function 

-  2 
|m    -  m  I 

J(w)=-z  =   (4.24) 

is  maximized. 

To   obtain  J  as   an  explicit   function  of  w,   we   can  rewrite  in 

—  i 

(4.23)  as 
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X  e  X. 
—  1 

S?  =  w'^S.w  (4.25) 
1—1— 


where 


Si  =    I       (iL  -  Si)(2L  -  ii)*^  (4.26) 

x£  X. 
1 

is  called  the  scatter  matrix.    Then  the  within-class  scatter  equals  to 


+  S|  =  w*^(S^  +  S2)w 


where 


=  w^S  w  (4.27) 


is  called  the  within-class  scatter  matrix.  The  within-class  scatter 
matrix  is  proportional  to  the  sample  covariance  matrix  for  the  pooled 
p-dimensional  data,  is  symmetric  and  positive  semidef inite,  and  is 
usually  nonsingular  if  n  >  p.     Similarly,   [m^^  -  m^  |     can  be  written  as 

i"        "  ^2      /  t  t  .2 

yva^  -  sil^)    =  (w  fflj^  -  w  m2) 

=  w^(m^  -  5.2^^^!"  3.2)^  (4.29) 


=  /SgW 


where 


Sg  =  (mi  -  m2)(m^  -  m2)*^  (4.30) 
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Is  called  the  between-class  scatter  matrix,  which  is  also  synmetric  and 

positive  semidef inite,  but  because  (m^  -  m^)  is  a  vector,  i.e.  pxl 
matrix,  its  rank  is  at  most  one  [58] . 

In  terms  of  Sg  and  S^,  the  criterion  function  J  can  be  written  as 

w*^S^ 

J(w)  =  —   .  (4.31) 

w  S  w 
—  w— 

It  can  be  shown  that  the _w  that  maximizes  J  in  (4.31)  is  [1A,46] 


w  =         (m^  -  m^)  (4.32) 


which  is  Fisher's  linear  discriminant,  i.e.,  the  linear  function  which 
maximizes  the  ratio  of  the  between-class  scatter  to  within-class 
scatter.  Thus,  the  p-dimensional  problem  has  been  converted  to  a  one- 
dimensional  problem.  It  will  be  seen  in  the  next  chapter  that  Fisher's 
linear  discriminant  can  be  used  to  reduce  the  dimensionality  of  the 
data,  via  feature  extraction  and  feature  selection.  Actually,  Fisher's 
linear  discriminant  can  be  extended  to  convert  the  data  from  p- 
dimensions  to  d-dimensions  for  p  >  d  >  1  [59].  Note  that  this  mapping 
can  not  theoretically  reduce  the  minimum  achievable  error  rate.  In 
general,  one  has  to  sacrifice  some  of  the  theoretically  attainable 
performance  for  the  advantage  of  being  able  to  work  in  a  lower 
dimension.  However,       if      the      conditional      density  functions 

p(x|<*'^)  (i=l,2)  are  multivariate  normal  with  equal  covariance  matrices, 
one  need  not  sacrifice  performance.  Recall  from  Chapter  2  that  for  this 
case  the  discriminant  function  is  (Eq.  (2.31)  in  Chapter  2) 
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D(x)  =  w'^x  +  w  (4.33) 


where 


-1 

w       =    I  -  (4.34) 

If  the  estimates  of  the  sample  means  and  the  sample  covariance  matrix, 
and  I  are  used,  the  _w's  in  (4.34)  and  (4.32)  are  the  same.  Thus 
the  samples  are  mapped  into  the  same  direction  except  for  Fisher's 
linear  discriminant.  The  threshold  weight  w^  must  be  estimated 
separately. 

It  is  also  interesting  to  compare  (4.19)  and  (4.1).  These  two 
equations  are  almost  the  same  except  (4.1)  has  an  additive  term,  the 
threshold  weight  w^.  Thus  the  linear  discriminant  function  in  (4.1)  can 
be  interpreted  in  a  manner  similar  to  that  above,  i.e.,  each  point  x_  is 
projected  onto  a  line  in  the  _w  direction,  then  the  origin  of  the  new 
axis  is  translated  to  the  right  (in  the  positive  direction)  by  w^. 
Every  point  falling  on  the  positive  side  of  this  new  origin  is  decided 
as  coming  from  class  oij^ ,  otherwise  0^2'  By  the  same  token  Fisher's 
linear  discriminant  can  be  interpreted  as  a  method  for  finding  the 
hyperplane  which  partitions  the  feature  space.  However  Fisher's  linear 
discriminant     only     gives     the     direction     (orientation)     w     of  the 
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hyperplane.  The  discriminant  cannot  determine  the  position,  i.e.  the 
threshold  weight.  This  is  why  we  stated  earlier  that  Fisher's  linear 
discriminant  does  not  solve  the  classification  problem. 


CHAPTER  5 
FEATURE  SELECTION  AND  EXTRACTION 

So  far,  we  have  only  dealt  with  the  problem  of  designing  a 
classifier.  In  this  chapter  we  discuss  feature  selection  and 
extraction,  whose  purpose  is  to  reduce  the  number  of  features  required 
to  represent  the  data  while  retaining  the  class  discrimination 
information.  The  main  motivation  for  reducing  the  number  of  features  is 
to  simplify  the  classifier  and  to  avoid  the  dimensionality  problem.  The 
dimensionality  problem  is  a  phenomenon  recently  observed  by  many  authors 
[14,  17-20].  This  phenomenon  may  be  described  as  follows.  Consider  the 
situation  where  only  a  finite  number  of  training  samples  for  two  classes 
of  data  are  available.  The  performance  of  a  classifier  for  this 
training  set  is  not  necessarily  improved  as  one  increases  the  number  of 
features  representing  the  data.  In  fact,  the  classifier  performance  may 
even  deteriorate  as  we  add  features  to  our  feature  set.  This  will  be 
discussed  more  fully  in  Chapter  7. 

There  are  two  feature  reduction  techniques.  One  is  feature 
selection  in  the  measurement  space,  known  as  feature  selection.  The 
other  is  feature  selection  in  the  transform  space  and  is  known  as 
feature  extraction.  Feature  selection  is  simply  an  algorithm  for 
choosing  a  subset  of  p  measurements  (features)  that  maximize  some 
criterion  functi  on.  Thus  an  optimal  selection  can  be  achieved  only  by 
testing   all   possible   sets    of    p    features    chosen   from   P  measurements, 
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P 

i.e.,  by  applying  the  criterion  (P)  =  p!/(p!(p  -  p)!)  times  .  The 
difficulty  with  this  method  is  that  the  number  of  calculations  required 
is  enormous  even  for  a  modest  number  of  features.  In  practice,  some 
suboptimal  procedures,  such  as  the  search  algorithm  known  as  branch  and 
bound,  or  heuristic  approach  is  used.  Some  of  these  methods  can  be 
found  in  [47,60].  In  Chapter  7  we  suggest  an  algorithm  for  solving  this 
problem. 

For  feature  extraction,  the  problem  is  to  find  a  p  x  P  matrix  W 
such  that  the  derived  features  ^  =  Wx_  maximize  some  criterion.  All 
feature  extraction  algorithms  can  be  classified  into  two  categories. 
The  first  category  solve  the  matrix  W  directly.  This  method  is  similar 
to  Fisher's  linear  discriminant  method  described  in  Chapter  4.  For  the 
second  category  the  data  is  transformed  to  another  domain,  for  instance 
by  the  Fourier  transfonn,  the  Hadamand  transform,  the  Karhunen-Loeve 
expansion,  etc.  Then  features  are  selected  from  the  transformed 
domain.  In  this  chapter,  we  discuss  several  of  these  methods.  In 
Section  5.2,  we  discuss  a  method  suggested  first  by  Foley  and  Sammon 
[59].  This  method  is  an  extension  of  Fisher's  linear  discriminant  and 
falls  in  the  first  category.  We  then  describe  in  Section  5.3  a  feature 
extraction  algorithm  belonging  to  the  second  category.  This  method  is 
based  on  Karhunen-Lo^ve  expansion.  Since  all  feature  extraction 
algorithms  require  a  feature  selection  criterion,  we  will  first  discuss 
this  topic  in  Section  5.1. 
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5.1  Feature  Selection  Criteria 
The  ultimate  goal  of  a  pattern  recognition  system  is  to  properly 
classify  the  data  with  as  low  a  classification  error  rate  as  possible. 
Thus  an  ideal  criterion  for  feature  selection  is  to  minimize  the  error 
rate.  Unfortunately,  the  error  rate  is  difficult  to  evaluate.  To 
overcome  this  theoretical  problem  other  criteria  are  chosen  which  vary 
monotonically  with  the  error  rate.  The  functions  are  selected  to  be 
easily  calculated.  Generally,  these  criteria  should  possess  the 
following  properties. 

(1)  Should  be  monotonically  increasing  or  decreasing  with  the 
probability  or  error,  or  with  the  bound  (upper  or  lower)  on 
probability  of  error. 

(2)  Should  be  invariant  under  one-to-one  mapping.  This  property  is 
important  since  the  probability  of  error  is  invariant  under  any 
transformation  which  holds  one-to-one  correspondence. 

(3)  All  the  selected  features  should  be  uncorrelated. 

(4)  Satisfy  the  metric  properties,  i.e. 

(i)  c(  0)^,(0.  :  Y^)  >  0    for  ±  ^  2 

(ii)  c(  0)^,03^  :  Y-^)  =  0 

(iii)  c(a)^,a)j  :  Y^^)  =  c(cD^,a)^  :  Y^^) 

(iv)  c(a)^,a)j   :  Y^^)   <  c((D^,a)^   :  Y^^^) 

where  Y^  denotes  a  subset   of   1   features,    o)^  denotes  class  o)^ 
and  c(  )  denotes  a  criterion. 
Note    that    all    the    above    properties    are    not    necessary    but  are 
desirable.     So  far  for  the  general  case,  none     of  the  existing  criteria 
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satisfy  all  of  the  above  conditions.  Some  of  the  criteria  that  have 
been  suggested  are  divergence  [61],  Bayesian  distance  [62],  Matusita 
distance  [63]  ,  Bhattacharyya  [63]  ,  etc.  All  of  these  measures  are  well 
documented  in  [47,64],  so  we  will  not  discuss  them  here.  The  only 
criterion  that  we  will  be  using  in  the  rest  of  this  chapter  is  Fisher's 
ratio. 


For  this  type  of  feature  extraction,  the  feature  vectors  are  a 
linear  combination  of  the  measurement  vectors  which  minimize  or  maximize 
a  criterion  and  the  features  are  mutually  orthogonal.  In  other  words, 
let  J  be  a  criterion  and  x_  be  a  P-dimensional  measurement  vector.  We 
wish  to  find  a  p-dimensional  vector  ^  (p  <  P)  such  that 


5.2    Feature  Extraction  Based  on  Discriminant  Analysis 


Wx 


(5.1) 


where 


W 


an  p  X  P  matrix  such  that  \j 


w.  =  0  if  1*2 


and  ji_  maximizes   the   criterion  J.     An  example  of   this   type  of  features 


extractor    is    the    Foley-Sammon    method  [59]. 


Foley-Sammon  suggest 


Fisher' 


s  ratio  as  a  criterion,  i.e 


J  = 


(5.2) 
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where 

V/         =      P-dimensional    column    vector    onto    which    the    data  are 

projected 
^t       =  transpose  of  w^ 


-J 


(xfi"\  .x^'''^  ith  sample  vector  for  class  (u. 

jl  JP  1 

n.        =      the  number  of  sample  in  class  o). 
1  In. 

jj       =      estimated  mean  of  class  co. ,  u    =  —    Y  x 
_A       =      difference  in  the  estimated  means,  ^  =  _M ^  ~  ^2 
=      within-class  scatter  for  class  00^, 

j=i  ^  ^ 

=      sum  of  the  within-class  scatter,        =  Sj^  + 
From  (4.32)  in  Section  4.3, 


w,  =  a.S~}  A  (5.3) 
— 1  w  — 


where  Wj  is  the  direction  of  the  best  feature  and  a^^  is  chosen  such  that 

t  , 
Wj^  Wj  =  1. 

The  other  w. 's  for  i  >  1  are  subjected  to  the  following  contraints, 
—a 

w^  maximize  the  criterion  J  and  w^  is  orthogonal  to  w_.  for  j  <  i. 
Foley  and  Sammon  [59]   give  a  sequential  algorithm  to  achieve  this. 


5.3    Feature  Extraction  Based  on  the  Karhunen-Loeve  Expansion 
Before  describing  how  the  Karhunen-Loeve  expansion  may  be  utilized 
for  feature  extraction,   let  us  consider  the  expansion  itself.     Only  the 
discrete-time  case  will  be  considered  here.     For  the  continuous  case  see 
[65]. 
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Consider    an    ensemble    of    P-dimensional    random    vectors      x  and 

suppose   that   any  sample  vector  x_  =    [xpX2,  >Xp]^   from  this  ensemble 

belongs  to  one  of  the  m  possible  pattern  classes  ii)^,i=l,  ,m,  where  the 

probability  of  occurrence  of  the  ith  class  is  P(oj^).  Let  us  assume  also 
that  each  class  has  been  centralized  by  subtracting  the  means,  of  the 
random  vector  _x^  in  that  class.  Thus,  if  the  centralized  observation 
from  the  class  oj^  is  denoted  by  z^  we  can  write. 


z .  =  x  -  y. 
— 1     —  -1 


Suppose    now    that    we    wish    to    represent  z^  by    a    special  finite 
expansion  of  the  form, 

k=l 

where  the  u^^  are  orthonormal  deterministic  vectors  satisfying  the 
condition 


u%*T=  6,1  (5.5) 
—  kr^  1  kl 


while  the  v^^^  are  zero  mean  and  mutually  uncorrelated  random 
coefficients  for  which 


X  '^^i^  '^vii>  =  i  \i  (5.6) 
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with    the    *    in   both    (5.5)    and    (5.6)    denoting    the    complex  conjugate, 

and  6,  ,  represents  the  Kronecker  delta  function,  i.e. 
kl 


X     _  /I     ,    k  =  1 
\l  "  ^0     ,    k  *  1  ' 


The  deterministic  vectors  u^  in  equation  (5.4)  are  called  the 
Karhunen-Lo^ve  coordinate  vectors.  Chien  and  Fu  [66]  have  shovm  that 
these  vectors  are  the  eigenvectors  of  the  covariance  matrix  C  of  z^, 
where 

m  ^ 

C  =     I    P(a)jE{z^z'^.   }  (5.7) 
1  — i —  1 

1=1 

2 

and  that  ^  =  Pj^         their  associated  eigenvalues. 

The  importance  of  the  Karhunen-Lofeve  (K-L)  coordinate  vectors  in 
signal  processing  becomes  apparent  if  they  are  arranged  in  a  descending 
order,  according  to  the  magnitude  of  their  corresponding 
eigenvalues  X^,  i.e., 

X,   >   >  X    >  >  X      .  (5.8) 

12  p  P 


For  then,  if  an  approximation  of  ^  of  ^  is  constructed  of  the  form 


k=l 

where  p  <  P,  it  can  be  shown  [66]   that  both  the  mean  square  error  e  , 
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-2 
e 


=     I    P(a).)  E{|z.  -  z.  p}  (5.10) 
i=l 


and  the  entropy  function  Hp,  where 

H    =     I    p^log  (5.11) 
k.=p 


are    minimized    with    respect    to    the    ordering    of    the  eigenvectors. 

Moreover,      Tou     and     Heydorn      [67]      pointed     out     that     the  total 
p 

V      2  2 

entropy  H  =  -  2,    Pi,  log  Pi,  associated     with      the      coordinated  system 
k=l     ^  ^ 

u^,k.=  l,  ,p,   obtained  in  this  manner  is  minimized  with  respect   to  any 

other  coordinate  system. 

In  order  words,  by  ordering  the  eigenvectors  in  correspondence 
with  the  decreasing  magnitude  of  their  associated  eigenvalues,  we  obtain 
the  optimal  reduced  K-L  coordinate  system  in  which  the  first  p 
coordinate  coefficients,  Vj^^,  defined  as 


v,  ,  =  z3  u,  (5.12) 
ki     ^  -^k 


contain  most  of  the  "information  for  reconstruction"  of  the  random 
vector  x^. 

Finally,  it  should  be  noted  that  if  the  transformation  in  equation 
(5.12)  is  applied  to  the  original  pattern  vector  k_  with  means  included, 
then  we    obtain   a   new   feature   vector  This    transformation   can  be 

written  in  vector-matrix  terms  as 
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y    =iEU  (5.13) 


where  U  is  a  matrix  composed  of  the  complete  set  of  ordered 
eigenvectors,  i.e., 

U  =  IEi.H.2»  ."J       •  (5.14) 


The  eigenvalues  of  the  new  covariance  matrix  are  then  the 
variances        of  these  new  features,  i.e. 


2  " 

\  =  \  =         P('^i)var(y^^)  (5.15) 


where 


varCy^^.)  =  E{(y^.  -  Vj^.)^}  (5.16) 


and  Vj^^  is  the  transformed  mean  of  the  feature  y,  .  ,  i.e.. 


\i  =  ^  •  (5.17) 

A    number    of    feature    extraction    techniques    based    on    the  K-L 

expansion  have  been  suggested  in  recent  years.     These  techniques  can  be 

classified  into  six  major  categories. 

(1)     In  the  first  approach  suggested  by  Chien  and  Fu   [66],   the  K-L 

expansion    is    based    on    coordinate    vectors  u    =  fu      u   ;i  1 

^  lk'"2k'  '"kk^ 


61 


which  are  chosen  as  the  eigenvectors  of  the  covariance  matrix  C, 
where 


m 


C=     I    P(a))  E{[x    -ii.][x.  -  V.]^}     .  (5.18) 
i=l        i         -1  -a 


As  we  have  seen  previously,  if  the  eigenvectors  are  ordered 
according  to  the  magnitude  of  their  associated  eigenvalues  X^,  then 
the  mean  square  error  e  and  the  Hp  as  defined  in  (5.10)  and  (5.11), 
respectively,  are  both  minimized  by  taking  the  first  p  features  (p  < 
P)  of  the  transformed  feature  vector  j_,  where 


1    =  ]i       ^      '  (5.19) 


(2)  The  second  approach  was  suggested  by  Tou  and  Heydorn  [67]. 
Here  the  transformation  matrix  Up(Pxp)  is  formed  by  the  first  p 
eigenvectors  of  the  same  covarience  matrix  C  as  in  (5.18)  above. 
But  their  choice  of  the  eigenvectors  is  based  on  the  smallest 
eigenvalue  of  C,  i.e. 


A,   <  X    <  <  X    <  X        .  (5.20) 

1        z  p  p 

P  2  2 

Tou    and    Heydorn    [67]    show    that    the    entropy  H    =  -  I  P,    log  pf" 

^        k=l  ^  ^ 

associated   with    the   p   features    formed    in   this   way    is  minimized. 

This  technique  provides  a  very  good  description  of  each  individual 
class  but  there  is  no  guarantee  that  the  selected  features  will  have 
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any  discriminatory  power  between  several  classes. 

(3)  A  third  approach,  due  to  Watanabe  et  al.  [68],  is  called  Clafic 
uses  a  K-L  expansion  where  the  coordinate  system  is  optimal  with 
respect  to  one  class  (i)^  alone.  Here,  the  covariance  matrix  is 
given  by 

=  E[^]    xeui^    .  (5.21) 

In    this    case, the    means    of    the    representation    vector  jc    are  not 
subtracted.        If     the     transformation    matrix    U     is     composed  of 
eigenvectors    assoicated   with    the    eigenvalues        of    the  matrix 
placed   in   descending   order,    then  the   information  contained   in  the 
transformed  feature  vector        where      is  defined  by 

X*"  =       U  xeu)^ 

is  concentrated  on  the  first  few  coordinates.  But  when  input 
patterns  x  d  (u^,  the  information  in  the  corresponding  feature  vector 
^  will  be  spread  over  all  the  components  y^^.  In  this  way  a  measure 
of  the  dispersion  of  information  can  be  used  for  classification 
purposes . 

(4)  The  fourth  method  is  also  due  to  Watanabe  et  al.  [68]  ,  and  is 
called  Selfic.  In  this  method,  the  covariance  matrix  C  is  computed 
from  the  input  patterns  x_  centralized  over  all  classes  by 
subtracting  the  mean  vector  y  defined  as 
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y  =  E  [  x]  V  X 


(5.22) 


In  other  words,   the  covariance  matrix  C  is  defined  here  as 


C  =  E[(x-  _u)(x  -  v)^]  yx 


(5.23) 


Thus,   the  eigenvalues  of  C  are  functions   of  both  the  variances  and 


descending  order  according  to  their  corresponding  eigenvalues.  It 
is  suggested  that  those  features  y^^  with  small  eigenvalues  will 
contain  most  of  the  information  about  the  membership  of  the  pattern 
x_  in  the  appropriate  class. 

(5)  Fukunaga  and  Koontz  have  suggested  their  transform  [69].  The 
patterns  _x  are  first  transformed  by  a  matrix  Q  such  that  the 
covariance  matrix  C  of  the  resulting  vectors  is  a  unit  matrix,  i.e. 


Application  of  the  K-L  expansion  to  each  class  separately 
corresponds    to    finding    the    eigenvalues    and    eigenvectors    of  the 


the    means    of    each  class. 


The    eigenvectors    are    arranged  in 


(5.24) 


matrices  C^,i=l,2,  where  C. 


i  is  defined  as 


^i  "  P(<^i)E[Q'^2Dc'^Q] 


xeo). 
1 


(5.25) 
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In  this  way, the  sum  C  of  these  matrices        satisfies  the  condition 
C  =       +       =  I 

Fukunaga  and  Koontz  [69]  have  shown  that  the  two  classes  have  the 
following  properties: 

(i)  the  two  systems  of  eigenvectors  are  identical 

(ii)  the    important    features    are    never    shared    (the  largest 
eigenvalue  of  one  class  is  the  smallest  of  the  other). 

As  a  result,  the  new  basis  vectors  are  formed  by  eigenvectors 
corresponding  to  the  largest  eigenvalues  of  each  class. 
(6)  The  last  approach  has  been  suggested  by  Roucos  and  Childers 
[28,70].  This  method  is  similar  to  the  fourth  method,  but  instead 
of  ordering  the  features  by  their  eigenvalues,  they  are  ordered 
according  to  Fisher's  ratio. 


CHAPTER  6 
PERFORMANCE  ESTIMATION 


In  the  previous  chapters  we  discussed  the  design  of  a  classifier. 
After  the  classifier  is  designed,  one  needs  to  evaluate  its  performance 
to  compare  the  design  with  competing  designs.  The  error  rate  is  the 
performance  measure  that  will  be  discussed  here. 

The  estimation  of  the  error  rate  for  a  classifier  has  received 
considerable  attention  in  the  literature.  An  extensive  bibliography  has 
been  published  by  Toussaint  [71].  There  are  many  approaches  to  the 
problem.     In  this  chapter  we  discuss  only  a  few  of  these  approaches. 

6.1    Empirical  Approach 
This    approach    counts     the    number    of    errors    when    testing  the 

classifier   on  a   test   data  set,   which  can  be   chosen  in  numerous  ways. 

Here  we  describe  four  popular  procedures. 

(1)  The  Resubstitution  Estimate.  In  this  procedure  the  same  data 
set  is  used  for  both  designing  and  testing  the  classifier.  Many 
authors  [72-74]  have  shown  both  experimentally  and  theoretically 
that  this  procedure  gives  a  very  optimistic  estimate,  especially 
when  the  data  set  is  small.  Note,  however,  that  when  a  large  data 
set  is  available,  this  method  is  as  good  as  any  procedure. 
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(2)  The  Holdout  Estimate.  The  data  is  partitioned  into  two 
mutually  exclusive  subsets  in  this  procedure.  One  set  is  used  for 
designing  the  classifier  and  the  other  for  testing.  This  procedure 
makes  poor  use  of  the  data  since  a  classifier  designed  on  the  entire 
data  set  will,  on  the  average,  perform  better  than  a  classifier 
designed  on  only  a  portion  of  the  data  set.  This  procedure  is  known 
to  give  a  very  pessimistic  error  estimate. 

(3)  The  Leave-One-Out  Estimate.  This  procedure  assumes  that  there 
are  n  data  samples  available.  Remove  one  sample  from  the  data 
set.  Design  the  classifier  with  the  remaining  (n-1)  data  samples 
and  then  test  it  with  the  removed  data  sample.  Return  the  sample 
removed  earlier  to  the  data  set.  Then  repeat  the  above  steps, 
removing  a  different  sample  each  time,  n  times  until  every  sample 
has  been  used  for  testing.  The  total  number  of  errors  is  the  leave- 
one-out  error  estimate.  Clearly  this  method  uses  the  data  very 
effectively.  This  method  is  sometimes  referred  to  as  the  Jack  Knife 
method. 

The  leave-one-out  estimate  was  shown  experimentally  by 
Lachenbruch  and  Mickey  [73]  to  be  approximately  unbiased,  a  property 
independent  of  the  type  of  the  classifier  or  the  underlying 
distribution  of  the  data.  However,  there  are  at  least  two 
disadvantages  of  the  method.  Firstly,  the  unbiased  property  is 
achieved  at  the  expense  of  an  increase  in  the  variance  of  the 
estimator.  Secondly,  it  is  clear  that  this  procedure  requires  a 
great  deal  of  computation.     However,   Fukunaga  and  Kessel   [75]  have 
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shovm  that  for  the  multivariate  Gaussian  case  the  leave-one-out 
procedure  requires  little  extra  computation. 

(4)  The  Rotation  Estimate.  In  this  procedure  the  data  set  is 
partitioned  into  n/d  disjoint  subsets,  where  d  is  a  divisor  of  n. 
Then,  remove  one  subset  from  the  design  set,  design  the  classifier 
with  the  remaining  data  and  test  it  on  the  removed  subset,  not  used 
in  the  design.  Repeat  the  operation  for  n/d  times  until  every 
subset  is  used  for  testing.  The  rotation  estimate  is  the  average 
frequency  of  misclassif ication  over  the  n/d  test  sessions. 

When  d  =  1  the  rotation  method  reduces  to  the  leave-one-out 
method.  When  d  =  n/2  it  reduces  to  the  holdout  method  where  the 
roles  of  the  design  and  test  sets  are  interchanged.  The 
interchanging  of  design  and  test  sets  is  known  in  statistics  as 
cross-validation  in  both  directions.  As  we  may  expect,  the 
properties  of  the  rotation  estimate  will  fall  somewhere  between  the 
leave-one-out  method  and  holdout  method.  The  rotation  estimate  will 
be  less  biased  than  the  holdout  method  and  the  variance  is  less  than 
the  leave-one-out  method. 

6.2  Parametric  Approach 
The  estimates  in  the  last  section  do  not  assume  a  form  for  the 
distribution  function  or  the  type  of  the  classifer.  In  this  section,  we 
will  assume  that  the  form  of  the  distribution  function  is  known  but  the 
parameters  are  not.  These  parameters  have  to  be  estimated  from  the 
available  training  samples.     There  are  a  number  of  possible  error  rates 
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which  may  be  defined,  and  each  one  may  be  valuable  depending  on  the 
circumstances. 

Let  E(Aj^  ,A2)  be  the  error  rate  which  depends  upon  the 
classification  regions,  k^,  as  defined  in  Section  2.3,  and  the  presumed 
distribution,  k^,  of  the  observations  that  will  be  classified.  The 
arguments  k-^  and  A2  are  written  explicitly  to  emphasize  the  fact  that 
the  error  rate  E  depends  on  both  arguments,  i.e.,  the  classification 
regions  and  the  presumed  distribution  of  the  observations  that  will  be 
classified.  Note  that  specifying  the  classification  regions  is 
equivalent  to  specifying  a  specific  classifier.  Let  f  denote  the 
presumed  distribution  of  the  observations  to  be  classified.  R  denotes 
the  classification  regions  when  the  parameters  are  known,  i.e.,  f  is 
known.  In  other  words,  R  is  the  classification  regions  of  the 
classifier  which  is  designed  with  a  complete  knowledge  of  the  underlying 
distributions,  f.  Let  f  denote  the  presumed  distribution  of  the 
observations  with  its  parameters  estimated  from  N  training  samples  in 
which  n^  samples  come  from  oj^  and  n2  samples  from  (Uj.  Let  R  be  the 
classification  regions  when  the  parameters  are  unknown,  i.e.,  estimated 
from  N  training  samples.  In  a  sense  f  and  R  are  the  estimates  of  f  and 
R,  respectively.  To  avoid  confusion  between  the  expectation  notation  E, 
and  error  rate  E,  let  E[  ]  denote  the  expectation  operator  with  square 
brackets.    The  error  rates  of  interest  are  as  follows. 

(1)  The  optimum  error  rate.  The  optimum  error  rate  is  the  error 
rate  when  all  parameters  are  known.  In  our  notation,  the 
optimum  error   rate   is   represented   by  E(R,f).      For   the  Bayes 
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classifier,  the  optimum  error  rate  is  the  Bayes  error  rate  E* 
defined  in  (2.18)  and  (2.19)  of  Chapter  2.  As  an  example,  let 
us  consider  a  two-class  case  multivariate  normal  distribution 
with  equal  covariance  matrix  and  a  priori  probability.  From 
Section  2.4,  the  decision  rule  for  this  case  is 


aj*(x)  =  (D^     if  D(x)  >  0 


=  (i)^  otherwise 


(6.1) 


where 


D(x)  =  (x  -  I  (p^  +  u^))^     I  -  u^)       .  (6.2) 

Thus  the  corresponding  optimum  classification  regions  are 
=  {x|D(x)  >  0} 


(6.3) 


R2  otherwise 


Then  the  optimum  error  rate  is  given  by 


E(R,f)=|    /    p(x|oo  )dx+|    /    p(x|a)  )dx 


-|pr(D(x)  >  Oloj^)  +1  Pr(D(x)  <  0 1     )     .  (6.4) 


When  all  parameters  are  known  the  statistic  of  D  is  normally 
distributed  with  parameters 
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E[D(x)|u.]  =  ^ 


i+1 


(6.5) 


Var(D(x))    =  E[D(x)  -  D(ji^)]^ 


=  (Jil  -  Ji2>*'  I  ~^  (Jii  -  V^)  (6.6) 
.2 


2 

The  quantity  5  is  the  Mahalanobis  distance  with  known 
parameters.    Then  the  optimum  error  rate  is 


E(R,f)  =  <(.(-  |)  (6.7) 


where  <t>(z)  is  the  standard  normal  distribution  function  defined 
as 


,2  1     9  ' 

Kz)  =  2^    !      exp(-  Y  u^)  du     .  (6.8) 


The  actual  error  rate.  The  actual  error  rate  is  the  error  rate 
of  a  classifier  using  estimated  parameters.  We  denote  the 
actual  rate  by  E(R,f).     For  a  two-class  case  it  is  given  by 

E(R,f)  =  P2  /    p(xh2^'^-      ^1  •''    P(2ihi)dx      .  (6.9) 
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Consider  a  two-class  case  multivariate  normal  distribution 
with  equal  covariance  matrix  and  a  priori  probability.  The 
decision  rule  is  given  by 


co(x)  =  0),   if  D  (x)  >  0 
—         1  s  — 


=  (i)^  otherwise 


(6.10) 


where 


DgCx)  =  [x  -  7         +  m2)]'^s"^m^  -  m^)  (6.11) 


where  and  S  are  the  sample  mean  of  and  the  pooled 
unbiased  estimate  of  the  covariance  matrix,  respectively  i.e., 


n . 

=  J-    I      x^^^  (6.12) 
-X      n.  —J 

1      2      1=1  j=l 

In  the  above  expression  2ij^^  refers  to  the  jth  training  sample 
from  class  co^.    The  corresponding  classification  regions  are 

R,  =  {xId  (x)  >  0} 
1        — '  s  — 


(6.14) 


R2  otherwise 


The  actual  error  rate  for  this  case  is 
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*  1  1 

E(R,f)  =  -J  Pr(xERj  oj^)  +  Y  PrCxeR^lo)^) 

=  -jPr[Dg(x)  >  Oloj^]  +YPr[Dg(x)  <  0|oo^] 


(6.15) 


D  (x)  is     a     conditioned     normal     distribution     with  known 
s  — 

^^,^2,  and  S.  The  means  and  variances  are  given  by 


E[D^(x)|  a)_j^]  =  [JLl  -  "I  (m^  +  ^2^^^^  ^  ~ 


(6.16) 


=  D  (y.) 
s  —1 


Var  D^(x)  =  (m^  -  m^)^S  ^  I  S  \m^-  m^) 


(6.17) 


The  probability  that  x  falls  in         if      belongs  to  0^2,  is 

D  (x)  -  D  ( u  )  D«(iio) 
Pr(D  (x)  >  0|  oj.)  =  Pr[-^  >  -    ^  ] 


°s^^2^ 


(6.18) 


Similarly 


°s^^l^ 

Pr(D  (x)  <  0|u)  )     =  — — —]     .  (6.19) 


Thus 


1  °s^-^P         1  °s^-^2^ 

E(R,f)  =  j-  ^        ]   +  J  H  ]     .  (6.20) 


73 


The  expected  error  rate.  The  expected  error  rate  is  the 
expected  value  of  the  actual  error  rate  over  all  possible 
samples   of    size  and   n2,   which   is   sometimes  known  as  the 

expected  actual  error  rate.  In  other  words,  the  expected  error 
rate  is  the  average  of  the  actual  error  rate  when  n^^  and  n2 
training  samples  from  ui^  and  0^2,  respectively,  are  available. 
For  a  two-class  case  multivariate  normal  distribution  with 
equal  covariance  matrix,  the  expected  error  rate  is 

E[E(R,f)]  =        •  Pr[D^(x)  >  Oloj^]  +        •  Pr[D(x)  <  0 1  o)^  ]  . 

(6.21) 

The   distribution   of  is   very   complex.      Many  statisticians 

have  studied  this  problem  [76-91].    John  [76-78]  gave  the  exact 

distribution  and  the  associated  expected  error  rate  of  D  when 

s 

the  covariance  matrix  is  known.  ^  The  unknown  covariance  matrix 
case  has  been  studied  by  Sitgreaves  [79],  Approximations  have 
been  given  by  various  authors  [82-91]. 

Lachenbruch  [92]  suggested  an  estimate  for  the  expected 
actual  error  rate  based  on  the  expected  mean  and  variance 
of  D^Cjc).  The  expected  means  and  variance  for  the  sample  size 
nj^  and  n2  for  class  1  and  2  are 

1               i+^   9     P(n„-n  ) 
E[D  (x)|a).]  =  j  C,  [(-D'-^^a^  /     '  ]   i=l,2  (6.22) 
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where 


n  +n  -2 

C.  J;        ,  (6.23) 

1  n^+n^-p-a 


and 


p(n  +n  ) 

Var[D  (x)]  =  C,(6^  +      „  „    -)  (6.24) 
s  —  2  12 


where 

2 

(n^+n2-3)(n^+n2-2) 
^2  ~  (nj^+n2-p-2)(n^+n2-p-3)(nj+n2-p-5)     '  (6.25) 

Although  Dg(x)  is  not  normally  distributed,  for  nj^  and  nj 
sufficiently  large;  the  distribution  is  very  close  to  normal. 
Thus  the  expected  actual  error  rate  E  may  be  estimated  by 


E[E(R,f)]  =  P^(l)(-Yp  +  P2KY2)  (6.26) 


where 


E[D  (x)|a)  ] 

Y.  =  ^    ,     i=l,2      .  (6.27) 

[Var(D  (x))]^/^ 


8  — 


The  plug-in  estimate  of  the  error  rate.  The  plug-in  error  rate 
is  obtained  by  using  the  estimated  parameters,  f,  for  f. 
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E(R,f)  =        /      P(x|"i)dx  +  P2  I      P(xh2)dx      .  (6.28) 
^2  ^1 
For   the    two-class   multivariate   normal   distribution   case  with 
equal    covariance    matrices,  is    normally    distributed  with 

means  and  covariance  equal  to 


*  1  t  -1 

=  2  D 


1  t  -1 

D^Cji^)  =  [m^  -  -jCSLi  +  ™2^^   ^    ^-1  ~  -2^ 


1^2 
=  -  2« 


D 


-1  -2'  "  ^-1  -2' 
2 


where 


(6.29) 


(6.30) 


*  t  -1 

Vjj         =  (m^  -  m^)     S     (m^  -  m^) 

=  (m,  -  mj'^s"^(m,  -  m^)  (6.31) 


=  im^  -  m2)*'S  ^(m^^  -  m^)      .  (6.32) 


2 

D  is  the  Mahalanobis  distance  with  estimated  parameters.  Note 
that    the    plug-in    estimate    is    consistent    and  asymptotically 
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efficient.  However,  it  is  not  unbiased  and  it  may  be  quite 
poor  for  small  sample  sizes. 

6.3  Discussion 

The    reader   may   have    noted    that    all    the   estimates    in   the  last 

section,     except    the    plug-in    error    rate,     required    that    the  true 

distributions   be  knovm,    which   in  general   are   not  known.      The  plug-in 

error  rate  is  simple  to  calculate  but  is  noted  for  being  severely  biased 

unless    a   large    training   sample    set    is   available.      For   the  two-class 

multivariate   normal   distribution   case  with   equal   covariance  matrices, 

there   are    some   approximations    for   the   expected   error   rate  available. 

One    method    is    to    approximate    the    distribution    of        by    a  series 

expansion.      Then  replace   the   true   parameters   by   the   estimated  values, 

2  2 

for  example,  replace  6  by  D  .  Some  of  these  expansions  are  given  by 
Okamoto  [87],  Sitgreaves  [89]  and  Anderson  [90,91].  However,  it  is  not 
known  how  good  these  expansl,ons  are  for  small  samples.     Another  method 

is   given   by  Lachenbruch    [92]    as   discussed   in   the   last   section.  The 

2  2 

approximation  is  achieved  by  replacing  6    with  D    in  (6.22)-(6.27) . 

Lachenbruch    and    Mickey    [73]     compared    a    number    of    methods  for 

estimating   the   error  rate,  which  include  a  plug-in  estimate,   Okamoto 's 

expansion,    and   the   leave-one-out.      They   found   that   Okamoto 's  expansion 
2  2 

which  replaces  6    by  D    is  the  best  where 
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The  second  best   estimate  was   to  use  D     for  6    in  Okamoto's  expansion. 

In  [93],  Lachenbruch  claimed  that  his  method,  which  is  given  in  (6.22)- 

2  2 

(6.27)    where   we    replace  6       by   D  ,    is    comparable    to   using  Okamoto's 
2  2 

expansion  with  D  replacing  6  .  Dunn  [94]  compares  Lachenbruch' s 
estimate  [92]  with  the  results  from  a  Monte  Carlo  study  and  noted  that 
it  tended  to  be  slightly  conservative.  In  no  case  was  the  estimate  0.02 
greater  than  the  Monte  Carlo  estimate  for  the  error  rate. 

Note  that  the  major  portion  of  Section  6.2  has  been  concerned  with 
methods  based  on  the  normal  distribution  with  equal  covariance 
matrices.  If  the  data  is  not  normal  or  not  close  to  normal,  the  methods 
do  not  work  well  and  results  for  non-normal  data  do  not  exist.  Thus, 
other  approaches  like  those  in  Section  6.1  must  be  employed. 


CHAPTER  7 
EFFECTS  OF  FINITE  SAMPLES  ON  THE 
PERFORMANCE  OF  GAUSSIAN  CLASSIFIER 


In  Chapter  2  we  designed  a  classifier  when  complete  knowledge  of 
all  the  relevant  probabilities  was  available.  However,  in  most 
practical  situations  such  complete  knowledge  is  not  available,  instead  a 
set  of  training  samples  together  with  some  vague,  general  knowledge 
about  the  problem  is  available.  In  this  chapter,  we  will  study  how  this 
effects  the  performance  of  the  classifier  for  the  two-class  case  with 
equal  a  priori  probability.  Assume  we  know  that  the  conditional 
probability  densities  are  normally  distributed  for  both  classes  with 
equal  covariance  matrices.  The  only  unknowns  are  the  means  of  each 
class  and  the  common  covariance  matrix.  The  classifier  under 
consideration  is  the  Bayes  classifier  (Eq.  (2.30)  of  Chapter  2)  with  the 
sample  means,  ml^s,  and  sample  covariance  matrix  S  substituted  for  the 
true  values,  where 

(7.1) 


1  V  ( 
"i  j=l  -J 


2  "i 
"1^2-2  i=l    j=l  ^ 
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and  xj"^^  refers  to  the  jth  training  sample  from  class  w^.  We  will  show 
that  when  only  a  finite  number  of  training  samples  is  available,  there 
exists  an  optimal  number  of  features. 


7.1     The  Effects  of  Unequal  Training  Sample  Sizes 
We  showed  in  the  last  chapter  that   the  expected  actual  error  rate 
is  given  by  ((6.26)  of  Chapter  6) 


E[E(R,f)]  =  P^(j.(-Y^)  +  P2*^'^2^  ^^'^^ 


where 


E[D  (x)|(D  ] 

 ^  V7T    .  i  =  (7.4) 

[Var(Dg(x))]^^^ 

E[D(x)h.]     =^.[(-1)^^16^  ^— i-]     ,     i=l,2  (7.5) 

n  +n  -2 

C,  =      J     ^  ,  (7.6) 

1  n^+n^-p-a 

,      p(n  +n  ) 

Var[D  (x)]      =  ZAt  +      „  „      )  (7.7) 
s  z  n^n^ 

(n^+n2-3)(n^+n2-2)^ 
^2  "  (n^+n^_p_2)(n^-Hi2-p-3)(n^+n2-p-5)     *  ^^'^^ 

Rearrange   (7.3)-(7.8)    and   substitute   P,    =  p 

1       ^^2  =   0.5,    then  we   have  the 

following.     (To  simplify  our  notation  from  now  on  we  will  represent  the 

expected  actual  error  rate  E[E(R,f)]  by  P„.) 
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where 


I  <!>[-(«  -  3)]  +  I  <{.  [-(a  +  3)]  (7.9) 


6^ 

a  =  C 


3  p(n    +  n  )  1/2 

(  6"^  +  2_) 

^"2 


pCn^  -  n^) 

n  n  p(n    -  n  ) 

3  =   7  M — T-TTT  = 


^^2 


^    (n^  +       -  p  -  2)(n^  +       -  p  -  5)  1/2 


S      2  ^     (n^  +       -  3)(n^  +       -  p  -  3)  ' 

2 

Note    that    for   a   given   N   =         +  i5     and   p,  is  constant. 

However,  as  (n2  -  n^  >  0,  is  increased,  a  decreases  and  3  increases. 
The  maximum  value  of  a  and  the  minimum  value  of  3  occur  at  the  same 
point,  n2  -  nj^  =  0  or  n2  =  n^  Since  (j)(z)  is  a  monotonically  increasing 
function,  then  for  3  fixed,  P^,  increases  as  a  decreases.  From 
Figure  7.1  it  is  clear  that  as  3  increases,  Pg  increases  for  a  fixed. 
Therefore,  when  (n2  -  n^^)  >  0  is  increased  Pg  increases  for  a  given  N  = 
^1      ^2'  '^^^  minimum  P^  occurs  when  n^^  is  equal  to  n2.  This 

result  agrees  with  [95].  Also  observe  that  when  n2  -  n^^  increases,  P^j^ 
increases  while  Pg2  decreases  but  the  decreasing  Pg2  can  not  offset  the 
Increasing  Pgj  •  This  can  be  interpreted  as  follows.  As  (n2  -  n^^) 
increases  for  a  fixed  N  =  n^^  +  n2,   then  n2  is  becoming  larger  and  n^^  is 
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^[-(a-gi)]  +  ^[-(a+6i)] 

-<)'[-(ct-B)]  -  <))[-(a+e)] 
=  A  -  B  >  0 


a-3i     a-e    a  a+B'  a+Bi 


Figure  7.1 


The  effect  of  changing  3. 
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becoming  smaller.  Therefore  we  know  more  about  class  11^2  and  less  about 
class  ojj^ .  Thus  less  errors  are  made  when  the  observations  come  from  oj^ 
and  more  errors  are  made  when  the  observations  come  from  cOp  However, 
the  information  gained  about  class  ux^  can  not  offset  the  loss  in 
information  about  class 

9  9 

When  N  =  nj^  +  n2,  n^^  -  n2  and  6''  are  fixed  (i.e.  nj^ ,  n2  and  6'^  are 
fixed),  as  p  increases,  and  a  decrease  but  3  increases.  Thus,  the 
effect  of  an  unequal  training  sample  size  is  more  pronounced  when  a 
large  number  of  features  is  employed.  This  conclusion  agrees  with  the 
observation  in  [24] . 

When  n^  is  greater  than  n2  the  above  results  still  hold,  except  ?^-^ 
and  Pg2  interchange  their  roles.  After  this  section  we  will  consider 
only  the  case  where  nj^  =  n2. 

7 .2    Minimum  Increase  in  6    to  Avoid  Peaking. 
For  n2  =   n^    the   expected   actual   error   rate   Pg   in   (7.9)    can  be 
simplified  to 


P„  =  *(-Y) 


(7.10) 


where 


Y  = 


1  r(N  -  2  -  p)(N  -  5  -  P)/^  6 

2  ^    (N  -  3)(N  -  3  -  p)     ^      . -2  ,  4£-l/2 


(7.11) 


N  =        +  n2  =  2n 
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From  (7.10)  the  expected  actual  error  rate  is  only  a  function  of  the 
total  number  of  training  samples  N,  Mahalanobis  distance  6^,  and  the 
number  of  features  p.     Note  from  (7.10)   that  the  expected  actual  error 

rate  decreases  for  a  given  p  as  either  or  both  N  and  6     increase  but 

2 

increases  for  a  fixed  N  and  6'^  as  p  increases.     Thus,   if  a  fixed  number 

N  of  training  samples  is  available,   it  is  interesting  to  know  how  much 
2 

6  should  be  increased  to  justify  an  additional  feature.  In  other 
words,  what   is   the  least  contribution  to  the  Mahalanobis  distance  that 

an  additional   feature  should  make  in  order  for  it  to  be  included  as  a 

2  2 
new  feature.      Mathematically,    we  want   to   find  x     such  that  Pgj(('5  + 

x^),   N,   p+1)  =  Pe2('^^»         p).     But  =  Pg2»  when  they  have  the  same 

value  of  Y  i.e. 


W(N-2-p)(N-5-p)!^2  6^  , 
2^  (N-3)(N-3-p)   ^  ij 


l.(N-2-p-l)(N-5-p-l). 
r  (N-3)(N-3-p-l)  ' 


111 


6    +  X 


(62^x2-^A(2±li)  1/2 
N 


or 

(N-2-p)(N-4-p)(N-5-p)  ^  ^    ^        N  ''^     ( 6^  +  x^)^  ^^^^2) 

(N-3-p)^(N-6-p)  5^  + 

N 

It  can  be  shown  that  a  sensible  solution  of  (7.12)  is 

C.N  +  /  CfN^  +  16(6'^N  +  4p)(p+l)C, 

x2=62(-^^  ^-  (7.13) 

2(6%  +  4p) 
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where 

C    _  (N  -  2  -  p)(N  -  4  -  p)(N  -  5  -  p)  ^^^^^^ 
^  (N  -  3  -  p)^(N  -  6  -  p) 

2  2 
The  plots  of  X    versus  p  for  different  values  of  6    when  N  =  20  and  100 

are   shown  in  Figures   7.2a  and  7.2b.      From  these   figures,    it   is  clear 

2  ? 
that  when  N  is  large,  x    is  small.     On  the  other  hand,  when  6    or  p  is 

2 

large,    x     is    large    too.      This   may   be   explained   by   noting   that  the 

estimates  are  more  reliable  when  more  training  samples  are  employed  or 

less  parameters  are  estimated.     Another  important  feature  of  Figure  7.2 

2 

is  that  the  curves  have  exponential  character  for  large  values  of  6  . 
This  can  be  explained  by  rewriting  (7.12)  as  follows. 

(H-2-P)  f,        1    V,  ,     1    ,  (.a,, 2)2 

(N-3-p)   ^'  *  N-6-pJ  s2+A£  5* 

N 

when  6^  » 

N 


62  +  ^  "  62 

and 

N-3-p  ^  ^        N-6-p  ■' 


Hence,  from  (7.15)  it  can  easily  be  shown  that 
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P 

Figure  7.2b      The  minimal  curves  for  N  =  100. 
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This  explains  the  exponential  character  and  the  properties  of 
Fig.  7.2.    All  of  the  results  in  this  section  agree  well  with  [24]. 


Given  N  training  samples  (n  for  each  class,  N  =  2n)  and  P 
measurements  for  each  sample,  a  basic  question  is  how  many  features  and 
which  features  should  be  employed  to  minimize  the  error  rate?  As  we 
noted  in  the  previous  section,  the  error  rate  depends  on  the  total 
number  of  training  samples  N,  the  number  of  features  p  and  Mahalanobis 
distance  6^.  In  this  case  the  number  of  training  samples  N  is  fixed, 
thus  only  the  number  of  features  p  and  the  Mahalanobis  distance  6^  are 
variables.  However,  the  Mahalanobis  distance  6^  depends  on  the  p 
features  selected.  Different  sets  of  p  features  may  have  different 
values  of  Mahalanobis  distance.  Recall  from  the  last  section  that  for  a 
fixed  number  of  features  p,  the  error  rate  decreases  as  the  Mahalanobis 
distance  increases.       Thus,    the    problem   reduces    to    a    problem  of 

finding   p   and   a   corresponding   set   of   p   features   with   the  highest 
among   the   set   of   all   p   features   which   minimize   Pj,   in   (7.10).  Also 
recall  that  P^  is  monotonically  decreasing  with  y,   thus  we  can  maximize 
Y  in  (7.11)  or  y  with  the  constant  terms  removed,  then  y  becomes 


7.3    The  Optimum  Number  of  Features 


(N-2-p)(N-5-p) 
(N-3-p) 


1/2 


62 


(7.17) 
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One  way   of   solving   this    is   for   each   p   (p=l,  ,    P)    select  the 

combination  of  p  features  which  will  yield  the  highest   6^  (i.e.  select 

the  best   set   of   p  features),   then  obtain         from  (7.17).     The  optimum 

number  of  features  p^^^  is  the  smallest  p  such  that  T    >  T  ^,   and  those 

opt  p  p+l 

Pgpj.  features  with  the  highest  6^  are  the  corresponding  features.  As 
was  shown  in  the  last  section,  this  is  equivalent  to  the  problem  of 
determining  when  the  contribution  of  an  additional  feature  to  the 
accumulated  Mahalanobis  distance  falls  below  a  threshold.  Note  that  an 
additional  feature  is  always  the  best  among  the  remaining  features  (i.e. 
the  feature  are  selected  from  the  best  to  worst).  We  now  determine  the 
optimal  number  of  features  for  two  special  cases.  These  examples  will 
clarify  the  points  discussed  earlier  in  this  section. 

Case  1 .  Each  feature  is  equally  good,  that  is,  the  Mahalanobis  distance 
for  any  set  of  p  features  is 


52  =    ^    ^2  ,  „^2 


P  i=l 


d^  =  pd^         p  =  1,  ,P 


where  d    is  the  Mahalanobis  distance  for  each  feature.     One  such  case  is 

when   the   difference   between   the   means    is   (jij^  -  ^^2)  ~  t^^j  d»  » 

and  the  common  covariance  matrix  is  1  =1. 

After  p  features  are  selected,   T  in  (7.17)  becomes 


J,    ^  |-(N-2-p)(N-5-p)^^^^  pd^ 
p  N-3-P  ^  (pd2-.A£)l/2 
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1  1  1/2  2 

=  (p(N-4-p))     ^^,ly2  ^'-''^ 

Note  that   the  last  term  in  (7.18)   is  independent  of  p  and  for  large  N 

the   first   bracket   is   approximately   equal   to   1   thus   independent   of  p. 

Therefore,    the  p  which  maximizes         is   the  p  which  maximizes  p(N-4-p) 

which  is  p^p^  =  f  -2- 

Note  that  the  result  in  this  example  is  not  quite  the  same  as  that 

N 

in    [24],    where   they   obtained  J"  1    (i"^  notation).      However,  the 

difference  is  small  (i.e.  equal  to  1,  to  be  exact).  As  noted  in  the 
last  chapter,  the  Lachenbruch  estimate  tends  to  be  slightly 
conservative. 


Case  2.  The  contribution  to  the  Mahalanobis  distance  of  each  feature  is 
a  fixed  fraction  of  the  contribution  of  the  previous  feature.  Thus 
after  p  features  are  chosen,  the  accumulated  Mahalanobis  distance  is 


62  =  d2  +  c2d2  +  c'^d2  +  +  c^^P  ^h^,  c>0, 

P 


2p 

=  d' 


=  a2  1-c 


-^2 


1-C 


where  d2  is   the  Mahalanobis  distance  of  the  first  feature.     An  example 
of     this      case     is     when     the     difference     between     the     means  is 
(vij^  -  y  )  =  [d,  cd,  ,  c^  ^d]^     ^^'^  common   covariance  matrix  is 
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After  p  features  are  selected,   T  in  (7.17)  becomes 

r    =  ((N-2-p)(N-5-p)^^/^   i^ci  

p  (N-3-P)  ,,2  .  AE^ 

The  plots  of  r  versus  p  when  N=30  and  d=l  for  various  values  of  c  are 
shown  in  Fig.  7.3.  It  can  be  seen  that  increasing  c  leads  to  a  better 
classification  and  a  larger  optimal  number  of  features.  One  important 
characteristic  of  Fig.  7.3  is  that  when  c  is  small,  the  curves  reach  a 
peak,  at  a  very  small  value  of  p.  But  when  c  >  1,  i.e.,  the  successive 
features  are  rated  better  or  contribute  more  significantly  to  correct 
classification,  causing  the  peak  to  occur  at  a  very  large  value  of  p. 
This  might  lead  one  to  think  that  the  peaking  phenomena  can  be  avoided 
by  selecting  features  from  worst  to  best.  Fig.  7.5  shows  the  plots  of  T 
versus  p  when  the  features  are  selected  from  best  to  worst  as  well  as 
from  worst  to  best.  It  is  clear  that  even  though  the  peak  of  the  worst 
to  best  ordering  occurs  later,  the  performance  of  best  to  worst  ordering 
is  always  better  (i.e.,  higher  T) ,  except  when  all  the  features  are 
employed;   in  which  case  the  performance  is  the  same  for  both  cases. 

Figure  7.4  shows  the  plots  of  V  versus  p  where  d^  =  1.0  and 
c  =  0.90  for  various  values  of  N.  Note  that  increasing  N  improves  the 
performance  and  increases  the  value  of  the  optimal  number  of  features 
required  for  classification.  Table  7.1  shows  the  optimal  number  of 
features  for  various  values  of  c,  N  and  d^.  Note  that  for  a  small  value 
of  c,  increasing  N  does  not  significantly  increase  the  optimal  number  of 
features,  but  when  N  is  large  changing  d^  does  change  the  optimal  number 
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8  5  le  15  20  25 


P 


Figure  7.3 


Effect  of  changing  c. 
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Figure  7.5      Comparison  of  feature  orderings. 


Table  7.1      The  optimal  number  of  features  for  Case 
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Mahalanobis  distance 


c 

N 

0.  04 

0.50 

1.0 

10.0 

25.  0 

20 

2 

3 

3 

4 

4 

50 

3 

5 

5 

6 

6 

0.80 

100 

4 

6 

7 

3 

8 

150 

4 

7 

3 

9 

9 

200 

4 

7 

8 

10 

10 

20 

3 

4 

4 

5 

5 

50 

4 

6 

7 

8 

8 

0.85 

100 

5 

8 

9 

10 

10 

150 

5 

9 

10 

11 

12 

200 

6 

10 

11 

12 

13 

20 

4 

5 

5 

5 

5 

50 

6 

8 

9 

1 0 

10 

0.  90 

100 

7 

11 

12 

14 

14 

150 

8 

13 

14 

16 

16 

200 

8 

14 

15 

17 

17 

20 

5 

6 

6 

6 

6 

50 

10 

13 

13 

14 

14 

0.95 

100 

13 

18 

20 

21 

21 

150 

15 

22 

23 

25 

25 

200 

16 

25 

26 

28 

28 

20 

7 

7 

7 

8 

8 

50 

19 

20 

20 

20 

20 

0 . 99 

100 

33 

37 

38 

38 

38 

150 

44 

51 

52 

52 

53 

200 

53 

62 

63 

64 

64 

20 

g 

9 

g 

o 

o 

50 

23 

23 

23 

^  w 

1.00 

100 

TO 

~  o 

TO 

150 

73 

73 

73 

73 

73 

200 

98 

98 

98 

98 

98 

?  n 

o 
o 

o 

O 

o 

Q 

=in 

^  u 

CO 

<io 

Co 

1 .  01 

100 

fit 

O  X 

OX 

150 

106 

103 

102 

102 

102 

200 

150 

148 

148 

148 

148 

20 

12 

11 

11 

11 

11 

50 

40 

40 

40 

40 

40 

1.10 

100 

90 

90 

90 

90 

90 

150 

140 

140 

140 

140 

140 

200 

190 

190 

190 

190 

190 
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of  features.  When  c  is  large  ( «1  and  larger)  the  optimal  number  of 
features  does  not  depend  on  d^,  but  depends  on  the  number  of  training 
samples  N.    This  result  agrees  well  with  [24]. 

7 .4    The  Optimal  Number  of  Features 
for  Several  Covariance  Matrix  Structures 

In  the  previous  section,  we  considered  two  types  of  functions 
relating  the  Mahalanobis  distance  6^  and  the  number  of  features  p 
regardless  of  the  structure  of  the  covariance  matrix  and  the  mean 
vectors.  In  this  section  we  consider  three  types  of  Toeplitz 
matrices.  For  simplicity,  let  us  assume  that  the  difference  between  the 
means  for  each  feature  is  the  same  (and  they  also  have  the  same  sign), 

i.e.  ^2  ~  -Hi  ^         ^>   '  '^^"^»  Mahalanobis    distance  is 

the  sum  of  all  elements  of  I  times  d^.     It  will  be  shown  that  even 

in  this  simple  case,  it  is  difficult  or  impossible  to  order  the  features 
in  the  optimum  way  (i.e.  from  best  to  worst). 

Example  1 .  Equal  correlation  covariance  matrix.  This  form  of 
covariance  matrix  is  given  by 


I  = 


1 
P 


P 
1 
P 


P 
P 
1 


P 
1 


,  where  -r  <  p  <  1 

p-1 
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It  can  be  shown  that  the  inverse  of  this  matrix  is  given  by 


r'  - 

where 

 1 

~  (1-p) 

Thus  we  obtain 

62=   2^   (7.20) 

%      p(p-l)  +  1 

Note  that  in  this  case  all  of  the  features  are  equally  good.  Any  set  of 

p  features  will  give  the  same  Mahalanobis  distance  6^  which  is  given  by 

(7.20).      However,    for   different   values   of   p,    each   additional  feature 

contributes  to  the  Mahalanobis  distance  a  different  amount.  This  can  be 

seen  from  6^  ,  -  6^  which  is  given  by 
p+1  p 

62      _  62  =   ^^^^  "    (7.21) 

p+1        p      (pp+l)(pp+l  -  p) 


X  y 

y  X 


y 
y 


+  (p-2)p 
(1  +  p(p-l)) 


X  y 
y  X 


-P 


(1  +  p(p-l)) 
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For  p  >  0,   (7.21)   shows  that   6^      -52  is  monotonlcally  decreasing  as  p 

p+1  p 

increases.         This     implies     that     the     increase     in     the  accumulated 

Mahalanobis   distance   becomes   less  with  each  additional  feature.  Thus, 

in   this    case,    we    can   select   the   features   arbitrarily   but   still,  in 

effect,  have  the  features  in  the  best-to-worst  ordering.     When  p  <  0  the 

situation  becomes  reversed.     The  increase  in  the  accumulated  Mahalanobis 

distance  becomes  greater  with  each  additional  feature.    Therefore,  it  is 

impossible  to  order  the  features  in  a  best-to-worst  ordering.     This  also 

implies  that  for  p  <  0,  peaking  is  not  really  a  problem  since  it  occurs 

at  p  greater  than  y  -  2.     Thus,  we  will  determine  the  optimal  number  of 

features  for  the  case  of  0  <  p  <  1  only. 

For  0  <  p  <  1 ,        decreases  as  p  increases  for  a  fixed  p.     So  we 
P 

would  expect  peaking  to  occur  at  a  smaller  p  when  p  is  larger.     When  p 

is   small,    6^  is   almost  proportional   to   p,    i.e.    6^      _  52  „  constant, 
p  P+1  P 

N 

thus  peaking  should  occur  at  p  "  —  -  2.  Then  for  a  larger  p,  the 
peaking  should  occur  at  p  <  |-  -  2.  The  optimal  number  of  features  for 
various  values  of  p,  d^  and  N  are  shown  in  Table  7.2.  This  table  shows 
that  when  p  is  large,  an  additional  feature  is  almost  useless.  When  p 
is  small,  the  optimal  number  of  features  is  almost  independent  of  d''. 
For  large  d,  T  in  (7.17)  can  be  rewritten  similarly  to  (7.18)  thus 
reducing  the  problem  to  a  problem  of  maximizing  the  expression 


which  gives 
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Table  7.2    The  optimal  number  of  features  for  Example  1. 


P 


d 

N 

0.  001 

0.  01 

0.1 

0.3 

0.5 

0.8 

20 

8 

7 

5 

2 

1 

1 

100 

47 

37 

13 

4 

2 

1 

0.1 

200 

93 

66 

18 

6 

3 

1 

500 

221 

128 

29 

10 

5 

2 

1000 

408 

203 

41 

13 

7 

3 

20 

8 

7 

5 

3 

2 

1 

100 

47 

39 

19 

9 

5 

2 

1.0 

200 

93 

71 

30 

13 

3 

3 

500 

223 

142 

50 

22 

13 

5 

1000 

412 

227 

74 

31 

18 

8 

20 

8 

8 

6 

4 

3 

2 

100 

47 

40 

21 

12 

8 

4 

10.0 

200 

94 

72 

33 

18 

12 

6 

500 

223 

143 

57 

30 

20 

10 

1000 

413 

230 

85 

44 

28 

14 
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(p-1)  +  /(p-1)^  +  p(N-4)(l-p) 


opt 


Example  2.  The  covariance  matrix  in  this  example  has  the  following 
form. 


P 

1 


p2  p 


,P-1 


P 
1 


p-1 
•       •  p*^ 


1 

P  1 


(7.22) 


Thus  the  correlation  between  features  x^  and  x^  is  equal  to  p'  '. 
For  simplicity  we  will  consider  only  the  case  where  0  <  p  <  1.  Note 
that  in  this  case,  when  each  feature  is  used  alone,  all  features  are 
equally  good.  However,  different  sets  of  p  features  may  not  give  the 
same  value  for  the  Mahalanobis  distance.  For  example,  the  Mahalanobis 
distance  of  any  two  features  x^  and  x^  is  monotonically  increasing  with 
I i-j I ,  since  the  correlation  between  two  features  is  lower  if  they  are 
farther  apart.  Thus,  the  best  combination  of  two  features  is  the  first 
and  the  last  one,  and  the  worst  combination  is  any  two  consecutive 
features. 

Jain  and  Waller  [24]   and  El-Sheikh  and  Wacker  [25]   have  considered 

this  problem.     They  find  the  optimal  number  of  features  p        to  be  the 

opt 

order  of  the  covariance  matrix,  p,  in  (7.22)  that  minimizes  the  expected 


100 


probability    of    error.       In    effect,    they    select    the    features    in  a 

"natural"  order.     It  is  shown  in  [25]    that  an  additional  feature  always 

increases  the  accumulated  Mahalanobis  distance  by  6^      _  ^2  _  i_E  j2  fQj- 

p+1        p  1+p 

N 

p  >  1 .  Thus  the  Pgpj.  approximately  (but  always  less  than)  ^  ~  ^ 
(since  the  first  feature  contributes  d^  >  ~^  d^).  However,  from  the 
above  discussion,  it  is  clear  that  values  given  by  [24,25]  are  higher 
than  the  true  ones.  In  this  case,  Pgpj.  also  depends  on  the  total  number 
of  features  available,  P,  in  addition  to  d^,  p,  and  N.  The  larger  value 
of  P,  the  better  the  performance  and  the  larger  p^^^.  will  be  for  a  fixed 
p,  d^  and  N.  One  of  the  best  ways  to  select  features  in  this  case  is 
the  first  one  first,  the  last  one  second,  the  middle  one  third,  etc., 
choosing  each  successive  measurement  (feature)  to  be  "maximally"  distant 
from  all  previously  selected  measurements.  The  plots  of  T  versus  p  for 
the  "optimal"  and  "natural"  orderings  when  d^  =  1,  p  =  0.5,  N  =  20  and  P 
=  13  are  shown  in  Fig.  7.6.  We  can  see  that  the  "optimal"  value  always 
gives  a  better  performance  except,  of  course,  when  p=l  or  13,  i.e.,  one 
or  all  features  are  used.  The  effect  of  changing  P  for  a  given  d^,  p, 
and  N  is  shown  in  Fig.  7.7.  The  optimum  number  of  features  for  various 
values  of  N,  P,   p,  and  d^  is  shown  in  Table  7.3. 


Example  3.  First  order  Makov  covariance  matrix.  The  form  of  the 
covariance  matrix  is  given  by 
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N  =  20 
d  =  1.0 
-P  =  13 

1  1  1  1  1  1  

p  =  0-5 

Optinal  ord*rin9 

Natural  ordering 

 I      ■  1 

1  •  1  ■   1 

1  •_  ■  1 

• 

e  3  6  9  12  15 

P 


Figure  7.6 


Comparison  of  feature  orderlngs  for  Example  2. 
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Table  7.3    The  optimal  number  of  features  for  Example  2. 


p 

d 

N 

0.1 

P 
0.5 

0.9 

o  n 

O 

■» 
o 

t\  in 
u .  1  u 

o 
o 

80 

30 

16 

3 

20 

6 

4 

2 

N/2 

1.00 

40 

15 

9 

3 

80 

33 

17 

5 

20 

6 

4 

2 

5.00 

■40 

15 

9 

3 

80 

33 

17 

5 

20 

5 

2 

1 

0.10 

40 

10 

4 

2 

80 

20 

9 

2 

20 

5 

3 

2 

N/4 

1.00 

40 

10 

5 

2 

80 

20 

12 

3 

20 

5 

3 

2 

5.  00 

40 

10 

6 

2 

80 

20 

12 

3 
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where     P  ^  "J              large  p.        For    this    type    of    covariance  matrix, 

correlation     exists     only  between     adjacent     features.         Again,  the 

discussion    in    Example    2  applies    here.       Jain    and    Waller    [24]  and 

El-Sheikh  and  Wacker   [25]  have   also  addressed   this   problem.     Jain  and 

Waller    [24]     show    that  -  6^  «  d^/Cl+Zp).      Thus,     p         is  again 

p+I        p  opt 

approximately  equal  to  y  -  2. 

Note  that  in  this  case  all  the  odd  features  are  uncorrelated  with 
each  other.  All  the  even  features  are  also  uncorrelated  with  each 
other.  Thus  it  seems  reasonable  that  all  the  odd  features  should  be 
chosen  first  since  the  number  of  the  odd  features  is  always  greater  than 
or  equal  to  the  total  number  of  even  features.  In  this  case,  p^p^.  also 
depends  on  the  total  number  of  features  available,  P.  The  comparison  of 
the  two  ordering  methods  is  shown  in  Fig.  7.8  for  N  =  20 ,  d''  =  1,  p  = 
0.3  and  P  =  13.  Fig.  7.9  shows  the  effect  of  changing  P  for  a  fixed  N, 
d  ^ ,  and  p . 

N 

When  P  >  N  -  4,  there  are  at  least  J  ~  2  uncorrelated  features.  In 
either  the  even  or  odd  feature  set,  each  feature  contributes  equally  to 
the  accumulated  Mahalanobis  distance  and  is  equal  to  d^.  Thus  in  this 
case  p^p^  =  |.  _  2.  Observe  that  when  either  all  the  odd  or  even 
features  are  chosen,  for  large  p,  an  additional  feature  from  another  set 
contributes     only     a     small     amount     to     the     accumulated  Mahalanobis 
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Figure  7.8      Comparison  of  feature  orderings  for  Example  3. 
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Figure  7.9      Effect  of  changing  P  for  Example  3. 
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distance.  Thus  for  large  p  and  P  <  N-4 ,  P^p^  is  approximately  (N+l)/2 
when  N  is  odd  and  N/2  when  N  is  even.  The  optimum  number  of  features 
for  various  values  of  d^,  N,   p,  and  P  are  shown  in  Table  7.4. 

7 .5  Remarks 

1.  It  must  be  emphasized  that  the  word  "feature"  here  is 
synonymous  with  "measurement"  or  "observation."  It  is  not  a 
weighted  combination  of  measurements.  This  implies  that  we 
must  use  feature  selection  to  reduce  the  number  of  features 
(measurements)  for  avoiding  the  dimensionality  problem.  The 
feature  extraction  algorithm  is  used  to  reduce  the  number  of 
featues,  simplifying  the  problem. 

2.  The  discussion  in  this  section  .is  valid  only  when  the 
difference  between  the  means  for  the  two  classes  is  equal  in 
both  sign  and  magnitude  in  all  directions  and  the  correlation 
coefficients  (p's)  are  positive.  If  any -of  these  conditions  is 
violated,  the  discussion  above  is  invalid  and  the  situation 
becomes  more  complicated.  For  example,  in  Example  1,  if  the 
difference  between  the  means  for  each  feature  is  equal  in 
magnitude  but  of  different  sign,  then  all  features  are  not 
equally  good.  As  another  example,  if  the  correlation 
coefficients  (p's)  are  negative,  the  feature  with  the  largest 
magnitude  of   | p|  may  be  the  best  feature. 

3.  From  the  examples  above  it  is  clear  that  in  general  there  is  no 
systematic  way  of  selecting  features.     The  best  way  is  to  try 
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Table  7.4      The  optimal  number  of  features  for  Example  3. 


p 

2 

d 

H 

0 .  06 

-  / 

P 

0 . 1 

0 . 3 

20 

6 

5 

5 

0 . 10 

40 

16 

13 

10 

80 

34 

30 

20 

7 

6 

5 

N/2 

1 .  00 

40 

1  s 

1  A 

80 
ow 

35 

WW 

20 

7 

7 

5 

5.00 

40 

17 

15 

10 

80 

35 

33 

20 

20 

5 

5 

3 

0.10 

40 

10 

10 

5- 

80 

20 

20 

10 

20 

5 

5 

3 

N/4 

1.00 

40 

10 

10 

5 

30 

20 

20 

20 

20 

5 

5 

3 

5.00 

40 

10 

10 

10 

80 

20 

20 

20 
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all    the    combinations.       However,  the    amount    of  calculation 

required    is    enormous    even    for   a  modest   number   of  features. 

Note  that  the  classification  rule  (2.30)  of  Chapter  2  can  be 
written  as 

X  e  oj^  if  D  =  x*^         (Jil  -  Ji2^  ~  2  ^-1  "  -2^^^  I  ^  (Jii  "         >  °* 
X  e  cj^  otherwise. 

where  x  is  the  measurement  vector  and 

x^  =  (x^,  x^,  ,  Xp).         This    rule    can    be    interpreted  as 

follows.  If  the  weighted  sum  of  measurements  x^  is  greater 
than  a  threshold,  we  decide  x  e  (d^  otherwise  x  e  1^2'  Thus  a 
significant  measurement  should  have  a  large  weight.  Then  we 
can  order  the  measurements  by  the  value  of  the  weighting 
vector,  (jij^  ~Ji2^*  ^    measurement    with     the  highest 

weighting  factor  should  then  be  selected  first. 


7 .6  Conclusion 

The  peaking  phenomenon  for  equiprobable  multivariate  Gaussian  data 
with  equal  covariance  matrix  has  been  investigated.  We  have  shown  that 
peaking  can  be  avoided  if  an  additional  feature  increases  the  value  of  T 
in  (7.17).  This  condition  has  been  shown  to  be  equivalent  to 
determining   when   the    increase    in   the   Mahalanobis   distance   is  smaller 
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than  a  threshold  given  by  (7.13).     The  condition  for  the  optimal  number 

of  features  for  three  types  of  covariance  matrices  has  been  studied.  We 

have   emphasized   the   importance   of   feature   ordering  an  suggested  that 

features  can  be  selected  in  the  same  order  as  the  order  of  the  weighting 
r-1 

factor   value,    i    (jj^^  ~  ii2)*  ^   fixed    total   number    of  training 

samples,  we  suggest  using  an  equal  training  sample  size  for  each  class. 


CHAPTER  8 
PATTERN  RECOGNITION  OF  EEC 

The  application  of  pattern  recognition  to  ERP  data  is  discussed  in 
this  chapter.  The  goal  of  our  experiments  is  to  assess  the  feasibility 
of  visual  evoked  response  to  determine  the  objective  validity  of  texual 
materials  as  read  by  human  subjects.  The  experiment  described  here  is 
one  of  many  experiments  conducted  at  the  Mind-Machine  Interaction 
Research  Center,  Department  of  Electrical  Engineering,  University  of 
Florida  which  have  been  reported  in  several  papers  and  reports  [12,96- 
102] .  We  briefly  describe  the  experimental  design  and  data  collection 
procedure  in  Sections  8.1  and  8.2  respectively.  Then  in  Section  8.3  we 
discuss  the  data  analysis  technique  and  compare  the  various  methods. 
The  results  are  discussed  in  Section  8.4. 

8.1  Experimental  Design 
The  rationale  for  this  experimental  design  was  given  by  Childers  et 
al.  [96].  In  this  experiment,  we  recorded  cortical  potentials  while 
subjects  decided  if  the  statements  presented  were  true  or  false.  The 
subjects  learned  the  true  value  of  these  statements  during  a 
familiarization  phase  preceding  data  collection.  The  purpose  of  the 
experiment  was  to  determine  if  the  pattern  of  difference  between  ERPs 
for  true  and  false  statements  could  be  observed  for  information  acquired 
recently  for  both  responding  and  non-responding  subjects. 

Ill 
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One  day  prior  to  collections  of  ERPs,  the  subjects  learned  a  set  of 
18  statements  (statements  number  1-9  and  19-27  in  Table  8.1)  about 
fictitious  people  and  their  occupations,  e.g.,  "Scott  is  a  lawyer".  The 
next  day,  they  were  shown  statements  consisting  of  correct  or  "true" 
pairings  of  names  and  occupations,  i.e.  those  18  statements  learned  in 
the  previous  day  or  with  "false"  or  mismatched  names  and  occupations 
(e.g.,  "Scott  is  a  singer"),  i.e.  statement  numbers  10-18  and  28-36, 
while  the  EEG  was  recorded.  Statements  were  randomly  presented.  In 
each  session,  each  statement  was  presented  twice.  There  were  a  total  of 
72  trials;  36  trials  corresponded  to  true  statements  and  36  trials  to 
false  statements.  Four  sessions  of  data  were  collected  for  each 
subject.  The  subjects  were  asked  to  respond  truthfully  in  two  sessions 
and  not  to  respond  in  the  other  two  sessions. 

8.2  Data  Collection  Procedure 
We  monitored  and  digitized  data  from  five  scalp  electrode 
locations:  Fz,  C3,  C4,  Cz  and  Pz  in  the  10/20  EEG  system,  from  one 
timing  channel,  and  from  an  EGG  channel.  Each  EEG  location  was  a 
monopolar  derivation  to  the  linked  mastoids.  The  data  were  digitized  at 
the  rate  of  125  samples/sec/channel  (or  8  msec  sampling  interval).  The 
amplifiers  had  a  bandwidth  from  1  to  50  Hz,  with  50%  attenuation  at  1 
and  50  Hz  [103].  Presentation  of  stimuli  and  collection  of  data  was 
controlled  by  a  NOVA  4  computer.  Details  of  the  software  can  be  found 
in  [104]  . 
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TABLE  8.1    Statements  Employed  in  the  Experiment 


1. 

2. 

3. 

4. 

5. 

6. 

7. 

8. 

9. 

10. 

11. 

12. 

13. 

14. 

15. 

16. 

17. 

18. 


SCOTT 
PAUL 
JIM 
DAN 

MARTIN 

STEVEN 

ROBERT 

WALTER 

LARRY 

SCOTT 

PAUL 

JIM 

DAN 

MARTIN 
STEVEN 
ROBERT 
WALTER 
LARRY 


IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 


LAWYER. 

DOCTOR. 

STUDENT. 

BROKER. 

CLERK. 

SINGER. 

FARMER. 

MECHANIC. 

SCIENTIST. 

SINGER. 

CHEMIST. 

FARMER. 

DENTIST. 

JUDGE. 

DOCTOR. 

DANCER. 

SENATOR. 

MECHANIC. 


19. 
20. 
21. 
22. 
23. 
24. 
25. 
26. 
27. 
28. 
29. 
30. 
31. 
32. 
33. 
34. 
35. 
36. 


GWEN 

FLO 

JOAN 

SUE 

DIANE 

BARBARA 

MARSHA 

ELLEN 

JUDY 

GWEN 

FLO 

JOAN 

SUE 

DIANE 

BARBARA 

MARSHA 

ELLEN 

JUDY 


IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 
IS  A 


WRITER. 

DENTIST. 

MANAGER. 

TEACHER. 

CHEMIST. 

SENATOR. 

JUDGE . 

BANKER. 

DANCER. 

BANKER. 

STUDENT. 

SCIENTIST. 

LAWYER. 

BROKER. 

MANAGER. 

CLERK. 

TEACHER. 

WRITER. 
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Prior  to  the  experiment,  the  subject  was  given  the  appropriate 
instructions  relevant  to  the  experiment  while  the  electrodes  were  being 
placed  on  the  scalp.  A  50  microvolt,  10  Hz  calibrate  signal  was 
connected  to  the  input  of  all  the  EEG  amplifiers,  prior  to  the  subject 
being  connected  to  the  amplifier,  to  calibrate  all  recording  channels. 
The  subject  was  then  seated  in  the  ventilated  Faraday  shielded  screen 
room  and  the  electrodes  connected  to  the  amplifiers. 

The  first  ER  trial  was  a  photic  evoked  response  using  a  Grass 
Instrument  xenon  photo  stimulator.  The  subject  was  then  presented  the 
sentences. 

The  sentences  were  presented  via  an  HP  2468A  graphic  terminal  which 
was  connected  to  the  TV  monitor  viewed  by  the  subject.  The  TV  monitor 
was  located  approximately  32"  from  the  subject's  eyes.  The  contrast  and 
brightness  were  adjusted  to  each  subject's  specification  for  viewing 
comfort.  The  sentence  was  presented  in  three  segments  to  minimize  eye 
movement,  e.g., 

SCOTT 
IS  A 
LAWYER 

The  sequencing  of  the  presentation  of  the  sentence  segments,  with 
respect  to  data  acquisition,  is  shown  in  Table  8.2.  The  total  data 
digitized  for  each  sentence  was  4.1  sec  or  512  samples  for  each  channel. 

The  subject  was  instructed  to  respond  (response  session)  by 
actuating  a  switch  within  1  sec  after  the  last  sentence  segment  appears. 
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TABLE  8.2    Timing  Diagram 


MSEC  EXAMPLE  EVENT 


Fixation  Box  Appears 

0  Begin  Data  Collection 

400  SCOTT                      Subject  Presented 

800  Subject  Erased 

1200  IS  A                      Verb  Presented 

1600  Verb  Erased 

2000  LAWYER                     Object  Presented 

2400  Object  Erased 

2000-4048  Response  Executed 

4048  End  Data  Collection 

4048  Message  (If  Any)  Presented 

4448  Message  Erased 

4448-7500  Write  Data  To  Disk  And  Start 

Next  Presentation 
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The  timing  of  the  segments  and  the  subject's  response  were 
collected  as  one  digitized  data  channel.  Fz,  Cz,  timing  and  EOG  were 
also  monitored,  via  a  Grass  Instruments  Polygraph,  on-line  during  the 
experiments. 

Eight  subjects  have  participated  in  this  experiment.  Each  subject 
attended  four  sessions  of  72  sentences  and  responded  truthfully  in  two 
sessions,  and  made  no  response  in  the  other  two  sessions.  Thus  we  have 
two  sets  of  data  for  each  subject,  one  is  the  "response"  data  and  the 
other  is  "no  response"  data.  Each  set  consists  of  two  classes  of  data; 
data  corresponding  to  the  true  sentences  and  data  corresponding  to  false 
sentences.  For  the  response  data  set  these  two  classes  are  called  true 
response  (TR)  and  false  response  (FR).  Similarly  for  the  no-response 
data,  one  class  is  called  true  no-response  (TN)  and  the  other  is  called 
false  no-response  (FN). 

8.3  Data  Analysis  Techniques 
Each  data  set  described  in  the  last  section  was  analyzed 
separately.  The  data  in  each  set  was  partitioned  into  two  subsets,  one 
was  used  to  design  a  classifier,  the  other  was  used  for  testing,  i.e. 
the  holdout  estimate  described  in  Section  6.1  was  used  for  estimating 
the  error  rate.  Since  each  data  set  contained  two  sessions  of  72 
sentences,  we  partitioned  the  data  by  using  one  session  for  design  and 
the  other  session  for  testing.  Thus  we  had  36  sentences  for  each  class 
(i.e.  36  training  samples  for  each  class)  for  designing  and  testing. 
This  number  is  generally  considered  to  be  small  especially  when  compared 
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with  the  number  of  features,  512  for  each  training  sample.  However,  we 
were  only  interested  in  discriminating  the  true  ERPs  from  the  false 
ERPs.  Thus,  we  need  only  look,  at  the  data  after  the  last  stimulus  and 
before  the  response  is  made  (for  response  session),  see  Figure  8.1. 
Consequently,  the  number  of  features  could  potentially  be  reduced  to 
less  than  262,  which  is  still  very  large.  Note  that  the  length  of  data 
(i.e.  the  number  of  features)  for  each  response  session  is  not  the  same 
for  all  sessions,  since  each  subject  takes  a  differing  amount  of  time  to 
answer  each  statement.  Thus  we  need  to  align  the  data.  In  this  study, 
three  data  alignment  techniques  are  considered. 

8.3.1    Data  Alignment  Techniques 

Three  data  alignment  techniques,  stimulus  locked  (S-locked), 
response  locked  (R-locked)  and  linear  time  warping  are  considered.  Each 
technique  is  based  on  different  assumptions.  The  segment  of  the  ERP 
between  the  last  stimulus  and  the  response  can  be  divided  into  three 
portions  (see  Figure  8.2).  The  first  portion  is  of  duration  x^^  which 
represents  the  time  delay  between  the  presentation  of  the  last  image  and 
the  initiation  of  the  subject's  thinking  process.  The  second  segment, 
T2  is  the  time  required  to  make  a  decision.  The  last  segment,  t^,  is 
the  time  required  to  make  a  response.  Ideally,  we  would  like  to  use  the 
data  from  the  second  segment  only.  However,  we  do  not  know  when  this 
segment  occurs.  Hopefully,  the  data  alignment  techniques  properly  align 
the  data,  so  that  the  second  segment  can  be  identified. 


Figure  8.1      The  segment  of  data  use  for  analysis  (a)  for  response 
session  (b)  for  non-response  session. 
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Figure  8.2      The  components  of  ERPs. 
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(1)  Stimulus  locked  (S-locked).  Under  this  method  the  data  are 
locked  to  the  last  stimulus  presentation  (2000  msec).  This 
technique  is  based  on  the  assumption  that  (the  time  delay  between 
the  last  stimulus  presentation  and  the  start  of  the  thinking 
process)  is  almost  constant.  Then,  when  the  data  are  locked  to  the 
stimulus,  the  second  portion,  (the  thinking  and  the  decision 
making  process)  will  align. 

(2)  Response  locked  (R-locked).  ■  Under  this  method  the  data  are 
locked  to  the  response  point  (arbitrarily  plotted  as  2800  msec). 
This  technique  is  based  on  the  assumption  that  (the  time  delay 
required  to  make  a  response)  is  almost  constant. 

(3)  Time  warping  (T-locked).  A  portion  of  the  signal  from  the  last 
stimulus  presentation  (object  of  the  sentence)  to  the  response  point 
is  either  linearly  stretched  or  compressed  to  make  all  trials  of 
equal  length  (120  samples  or  960  msec).  Thus,  for  example,  if  the 
last  stimulus  for  a  particular  trial  occurs  at  2000  msec,  the  data 
are  either  linearly  compressed  or  stretched  so  that  the  response 
becomes  located  at  2960  msec.  This  technique  is  based  on  the 
assumption  that  the  ratio  of  '^■^^''^2''^3  constant,  i.e.  when  the 
subject  responded  late  to  any  trial,  this  means  that  for  that 
particular  trial  he  was  slow  in  thinking  about  his  answer,  slow  in 
making  a  decision,  and  also  slow  in  making  the  response. 

The  plots  of  the  average  ERPs  over  all  8  subjects  when  the  data 
are  synchronized  by  S-locked,  R-locked  and  linear  time  warping  for 
channels  4,  6  and  timing  for  response  sessions  are  shown  in  Figures 
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8.3,  8.4  and  8.5  respectively.  The  data  for  S-locked,  channels  4,  6 
and  timing  for  no-response  sessions  are  shown  in  Figure  8.6. 

8.3.2    Features  Selection  and  Extraction  of  ERPs 

At  this  point,  we  have  some  one  hundred  features  or  so  for  each 
training  sample.  The  discussion  in  the  last  subsection  indicates  that 
the  segment  of  data  we  really  want  is  the  second  portion,  which  lies 
somewhere  between  the  last  stimulus  and  the  response.  Thus,  we  must 
take  a  segment  of  data  from  that  region  for  analysis.  The  averages  in 
Figures  8.3-8.6  can  help  us  locate  this  position  of  the  data.  From 
these  figures,  we  see  that  a  large  difference  between  the  two  classes 
occurs  in  the  region  2200-2400  msec  for  S-locked  data,  2500-2700  for 
R-locked  data  and  2400-2600  msec  for  the  time  warped  data.  This  gives 
25  features  for  each  trial  (training  sample). 

This   number   of    features    can   be   reduced   to   the  optimal  number  by 

using  a  feature  selection  technique.     Since  only  36  training  samples  are 

available  for  each  class   (with  25   features),    it   is  not  feasible  to  use 

the    nonparametric    approach    to    estimate    the    conditional  probability 

densities.    The  parametric  approach  must  be  employed  and  the  form  of  the 

conditional  probability  density  must  be  assumed.     In  this  case  we  will 

assume    that    the    data    are    normally   distributed   with   equal  covariance 

matrices  for  both  classes.    The  feature  selection  procedure  described  in 

Section  7.4  of  Chapter  7  can  be  used,   i.e.   the  features  are  ordered  by 
-1 

the    value    of     1  ~  —2^  where    the    covariance    matrix  1  and  means 

^,  and  Ji„  are  approximated  by  the  sample  covariance  matrix  S  ((3.10)  of 


123 


-le  - 


-IE 


e  laee  zeee  seee 

EVOKED  RESPONSE,  tUtJsMB,  SERIES:?,  BLOCK :R,  CHM:6     Cz ,  S-UOCK ,  TR  ws  FR 


MSEC 


(b) 


124 


ise 


lee  - 


58  - 


O 

o 

I— <  -58 


-188 


-158 


 1  1  1  !  1  !  1  1               1  i 

 1  •■  1  !  \  !  r—  f-  ,  r— 

ii    u  1 

'  I  i  .  I  1 

\ 

1       '       '  ' —  — 

1  

L_ 

;  i 

■ — ■ — 1  1  1 

1   •   ■   -   1 

1  ^  ■ 

e  leee  zeaa  3889 

EVOKED  RESPONSE,  SUBJiSOS,  SERIES:?,  BLOCKjR,  CHH:8  TIH,  S-LOCK,  TR  vs  FR 

(c) 


4868 

MSEC 


Figure  8.3      Average  of  stimulus-locked  data.     The  solid  line  is  the 
average  of  true  response  (TR) ,  while  the  dotted  line  is 
the  average  of  false  response  (FR)  for  (a)  channel  4; 
(b)  channel  6;     (c)  timing. 
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Figure  8.4      Average  of  response-locked  data.     The  data  have  all  been 
shifted  arbitrarily  so  that  all  responses  are  aligned  at 
2800  ms.     The  solid  line  is  the  average  of  true  response  (TR) 
while  the  dotted  line  is  the  average  of  false  response  (FR) 
for  (a)  channel  4;     (b)  channel  6;     (c)  timing. 
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Figure  8.5      Average  of  time  warping  data.    The  data  have  all  been 

either  stretched  or  compressed  so  that  all  responses  are 
aligned  at  2960  ms.     The  solid  line  is  the  average  of  true 
responses  (TR) ,  while  the  dotted  line  is  the  average  of 
false  responses  (FR)  for  (a)  channel  4;     (b)  channel  6; 
(c)  timing. 
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Figure  8.6      Average  of  stimulus-locked  data  for  no  response  sessions. 

The  solid  line  is  the  average  of  TN,  while  the  dotted 

line  is  the  average  of  FN  for  (a)  channel  4;     (b)  channel  6; 

(c)  timing. 
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Chapter  3)  and  sample  means  m^  ((3.8)  of  Chapter  3).  The  optimal  number 
features  and  the  corresponding  features  are  determined  by  (7.17)  of 
Chapter  7,  i.e.,  finding  a  p  such  that 


„    ^  .(N-2-p)(N-5-p),^^^  6^ 
P  '  ^     (N-3-P)  ^  1/2 


is  maximized,  where 


N  =  72 


2  2  t  -1 

6    is  approximated  by  D    =  (m^^  -  m^)  S     (m^^  -  m^). 

The  optimal  number  of  features  for  this  data  set  is  approximately  3 
or  4. 

To  simplify  the  design  of  the  classifier,  we  reduce  the  number  of 
features  to  two  by  employing  the  Foley-Sammon  method  [59]  described  in 
Section  5.2  of  Chapter  5.  Consequently,  each  training  sample  (or  each 
trial)  is  represented  by  two  features.  Then  these  two  features  are 
plotted  in  two  dimensional  space.  A  plot  of  all  the  training  samples  is 
called  a  scatter  plot.  Then  a  classifer,  a  line  which  separates  the  two 
classes  is  drawn.  This  classifier  is  used  to  classify  the  testing  data 
set  which  is  not  employed  in  the  design  stage  to  determine  the  holdout 
error  rate.  A  sample  figure  of  merit  table,  which  shows  the  figure  of 
merit  for  each  feature,  for  the  optimal  number  of  features  for  Channel  4 
and  6  is  shown  in  Table  8.3.  In  this  table,  the  label  "optimal  number 
of  features"  indicates  the  method  described  above  was  used  while  label 
"all  features"   indicates   a  method   that   used   all   the  features,    i.e.  25 
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Table  8.3      Figure  of  Merit 


Fiaher  Ratio   CALL  FEATURES  ARE  USED)     Clasas   TR  vs  FR 


SubjiPJO,   SeriP,   Seaaii,  S-LOCK   C2200   -  24003,     36  Samp/Cl, 

NO  OF  FEAT  1ST  VECTOR       2ND  VECTOR 

CHN.4       C4               25  8.86961  4.77280 

CHN16       Cr               25  8.52324  4.80850 


Fiaher  Ratio   (OPTIMAL  NUMBER  OF  FEATURES)     Claaa:   TR  va  FR 


Subj :PJO,  SerjP 

,  Seaail, 

S-LOCK  C2200 

-  240  0)  ,     36  Samp/Cl. 

HO 

. OF  FEAT 

1ST  VECTOR 

2ND  VECTOR 

CHN>4  C4 

1 

1.92600 

1.92600 

CHN16  Cz 

1 
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without  determining  the  optimal  number  of  features  (i.e.  apply  Foley- 
Sammon  method  directly  to  reduce  the  number  of  features  from  25  to  2). 
Figure  8.7  summarizes  the  procedure  for  both  methods.  A  sample  scatter 
plot  is  shown  in  Figure  8.8.  The  error  rate  of  both  design  and  test 
sets  for  S-locked,  R-locked  and  time  warping  for  response  sessions  are 
shown  in  Table  8.4,  8.5  and  8.6  respectively.  The  error  rate  for  no 
response  sessions  for  S-locked  is  shown  in  Table  8.7. 

8.4  Results 

From  Tables  8.4-8.7,  the  results  can  be  summarized  as  follows. 

(1)  The  error  rate,  when  the  optimum  number  of  features  is  used,  is 
on  the  average  better  than  when  all  features  are  used.  When  all 
features  are  used,  there  is  a  large  difference  between  the  error 
rate  of  the  design  and  test  sets.  This  result  verifies  the 
derivation  in  Chapter  7  and  confirms  the  existence  of  an  optimal 
number  of  features. 

(2)  For  the  sessions  when  the  subject  responds  with  a  finger  lift, 
the  S-locked  data  is  the  best. 

(3)  On  the  average,  the  response  sessions  have  lower  error  rates 
than  the  non-response  sessions.  (A  response  session  is  when  the 
subject  indicates  his  response  with  a  finger  lift.)  This  result 
agrees  with  [102]  which  indicated  that  the  response  session  shows  a 
larger  difference  than  the  non-response  session. 

(4)  The  error  rate  varies  form  subject  to  subject.  The  best 
subjects  are  PJO  for  response  sessions  and  J\T&  for  non-response 
sessions. 

(5)  Channel  4  (C4)  is  slightly  better  than  channel  6  (Cz). 
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Figure  8.7      A  summary  of  feature  reduction  techniques,   (a)  when 
optimal   number   of  features  is  used;   (b)  when  all 
features  are  used. 
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Figure  8.8        Scatter  plot;     (a)  for  design  set        (b)  for  test  set 
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TABLE  8.4    Number  of  Errors  for  the  True  Response 
(TR)  vs  False  Response  (FR)  S-Locked  (2200-2400),  72  Trials 


SUBJECT  CHN  4         C4  CHN  6  CZ 

OPT  ALL  OPT  ALL 


PJO 


VFH 


SMR 


DAT 


RHM 


JWB 


MPG 


JPM 


D  14  3  10  3 

T  17  22  17  22 

D  20  8  15  11 

T  27  29  34  30 

D  27  18  28  7 

T  30  40  30  43 

D  34  14  29  11 

T  26  25  28  30 

D  15  8  18  11 

T  26  24  29  26 

D  21  14  26  18 

T  31  29  25  29 

D  20  14  15  15 

T  26  29  36  38 

D  23  19  17  12 

T  41  37  35  39 


AVERAGE  D  21.75  12.25  19.75  11.00 

NUMBER 

OF  ERRORS  T  28.00  29.38  29.25  32.13 

ERROR  D  30.21  17.01  27.43  15.28 

RATE 

IN  %  T  38.89  40.80  40.63  44.62 
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TABLE  8.5    Number  of  Errors  for  True  Response  (TR)  vs. 
False  Response  (FR)  R-locked  (2500-2700),  72  Trials 

SUBJECT  CHN  4  C4  CHN  6  CZ 

OPT  ALL  OPT  ALL 


PJO 


VFH 


SMR 


DAT 


RHM 


JWB 


MPG 


JPM 


D  22  13  23  9 

T  26  33  41  42 

D  32  17  25  17 

T  33  40  40  40 

D  22  12  23  15 

T  38  34  39  40 

D  25  9  17  11 

T  38  34  32  34 

D  35  14  17  8 

T  37  31  38  36 

D  23  14  23  14 

T  36  44  47  45 

D  23  14  28  19 

T  34  31  34  26 

D  26  12  24  11 

T  36  42  32  38 


AVERAGE  D            26  13.13  22.50  13.00 

NUMBER 

OF  ERRORS  T  34.75  36.13  37.88  37.63 

ERROR  D  36.11  18.23  31.25  18.06 

RATE 

IN  %  T  48.26  50.17  52.60  52.26 
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TABLE  8.6    Number  of  Errors  for  True  Response 
(TR)  vs.  False  Response  (FR)  Time-Warping  (2400-2600),  72  Trials 


SUBJECT  CHN  4        C4  CHN  6  Cz 

OPT         ALL  OPT  ALL 


PJO 


VHF 


SMR 


DAT 


RHM 


JWB 


MPG 


JPM 


D  15  10  28  11 

T  32  30  33  29 

D  22  10  19  14 

T  37  36  39  33 

D  32  15  30  16 

T  32  29  34  29 

D  28  18  29  18 

T  28  35  34  32 

D  25  15  22  16 

T  32  41  40  37 

D  26  18  31  18 

T  43  38  33  36 

D  24  18  27  11 

T  37  38  38  36 

D  22  12  21  14 

T  32  42  35  33 


AVERAGE  D  24.25  14.50  25.08  14.75 

NUMBER 

OF  ERRORS  T  34.13  36.13  35.75  33.13 

ERROR  D  33.68  20.14  35.93  20.49 

RATE 

IN  %  T  47.40  50.17  49.65  46.01 
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TABLE  8.7    Number  of  Error  for  True  No-Response 
(TN)  vs.  False  No-Response  (FN)  S-Locked  (2200-2400),  72  Trials 


SUBJECT 


PJO 


VFH 


SMR 


DAT 


RHM 


JWB 


MPG 


JPM 


AVERAGE 
NUMBER 
OF  ERRORS 

ERROR 
RATE 
IN  % 


CHN4 

C4 

CHN  6 

GZ 

OPT 

ALL 

OPT 

ALL 

D 

30 

13 

36 

19 

T 

30 

35 

38 

34 

D 

27 

18 

24 

14 

T 

31 

35 

29 

34 

D 

23 

15 

32 

14 

T 

38 

35 

39 

28 

D 

29 

20 

30 

17 

T 

36 

40 

35 

35 

D 

30 

13 

24 

15 

T 

32 

33 

24 

33 

D 

19 

12 

28 

12 

T 

22 

23 

29 

22 

D 

21 

13 

19 

17 

T 

36 

34 

27 

D 

29 

22 

20 

17 

T 

29 

34 

39 

35 

D 

26.00 

15.75 

26.63 

15.63 

T 

32.13. 

33.88 

33.38 

31.00 

D 

36.11 

21.88 

36.98 

21.70 

T 

44.62 

47.05 

46.35 

43.06 

CHAPTER  9 
CONCLUDING  REMARKS  AND  FUTURE  WORK 


The  major  results  of  this  research  are  sunnnarized  as  follows.  1) 
The  derivation  of  a  relationship  between  the  optimal  number  of  features 
and  the  training  sample  size  for  the  two-class  Gaussian  data  case  with 
equal  covariance  matrices  where  means  are  unknown  as  well  as  the 
covariance  matrices.  2)  We  suggested  a  feature  selection  technique  and 
emphasized  the  importance  of  feature  ordering.  3)  We  studied  the  effect 
of  unequal  training  sample  size.  4)  We  applied  the  classification 
techniques  to  ERP  data  and  compared  different  data  alignment  techniques. 

Some  interesting  future  research  in  this  area  would  be  to  1)  study 
the  relationship  between  the  optimal  number  of  features  and  training 
sample  size  for  other  cases  of  Gaussian  data  with  unequal  covariance 
matrices,  2)  use  the  leave-one-out  or  Jack  Knife  method  to  estimate  the 
error  rate,  3)  find  some  new  data  alignment  technique,  for  example, 
non-linear  time  warping,  and  4)  use  some  technique  to  normalize  the  data 
to  decrease  the  variance. 
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